IaC-Eval: A code generation benchmark for Infrastructure-as-Code programs. In NeurIPS 2024

While LLMs show potential in general code generation, their efficacy in IaC development remains unknown. To address this, we developed the first dataset and benchmark capable of evaluating IaC code generation. Our dataset comprises 458 human-curated scenarios spanning various AWS services, involving over 1,720 hours of human effort. Our results reveal significant performance gaps.

September 2024 · Patrick Tser Jern Kon, Jiachen Liu, Yiming Qiu, Weijun Fan, Ting He, Lei Lin, Haoran Zhang, Owen M. Park, George Sajan Elengikal, Yuxin Kang, Ang Chen, Mosharaf Chowdhury, Myungjin Lee, and Xinyu Wang