Framework for evaluating code generation ability of large language models

Large language models (LLMs) have revolutionized various applications in natural language processing and exhibited proficiency in generating programming code. We propose a framework for evaluating the code generation ability of LLMs and introduce a new metric, pass‐ratio@n, which captures the granul...

Full description

Saved in:

Bibliographic Details
Published in	ETRI journal Vol. 46; no. 1; pp. 106 - 117
Main Authors	Yeo, Sangyeop, Ma, Yu‐Seung, Kim, Sang Cheol, Jun, Hyungkook, Kim, Taeho
Format	Journal Article
Language	English
Published	Electronics and Telecommunications Research Institute (ETRI) 01.02.2024 한국전자통신연구원
Subjects	code generation evaluation metric large language model natural language processing software engineering 전자/정보통신공학
Online Access	Get full text
ISSN	1225-6463 2233-7326
DOI	10.4218/etrij.2023-0357

Cover

Abstract	Large language models (LLMs) have revolutionized various applications in natural language processing and exhibited proficiency in generating programming code. We propose a framework for evaluating the code generation ability of LLMs and introduce a new metric, pass‐ratio@n, which captures the granularity of accuracy according to the pass rate of test cases. The framework is intended to be fully automatic to handle the repetitive work involved in generating prompts, conducting inferences, and executing the generated codes. A preliminary evaluation focusing on the prompt detail, problem publication date, and difficulty level demonstrates the successful integration of our framework with the LeetCode coding platform and highlights the applicability of the pass‐ratio@n metric.
AbstractList	Large language models (LLMs) have revolutionized various applications in natural language processing and exhibited proficiency in generating programming code. We propose a framework for evaluating the code generation ability of LLMs and introduce a new metric, , which captures the granularity of accuracy according to the pass rate of test cases. The framework is intended to be fully automatic to handle the repetitive work involved in generating prompts, conducting inferences, and executing the generated codes. A preliminary evaluation focusing on the prompt detail, problem publication date, and difficulty level demonstrates the successful integration of our framework with the LeetCode coding platform and highlights the applicability of the metric. Large language models (LLMs) have revolutionized various applications in natural language processing and exhibited proficiency in generating programming code. We propose a framework for evaluating the code generation ability of LLMs and introduce a new metric, pass-ratio@n, which captures the granularity of accuracy according to the pass rate of test cases. The framework is intended to be fully automatic to handle the repetitive work involved in generating prompts, conducting inferences, and executing the generated codes. A preliminary evaluation focusing on the prompt detail, problem publication date, and difficulty level demonstrates the successful integration of our framework with the LeetCode coding platform and highlights the applicability of the pass-ratio@n metric. Large language models (LLMs) have revolutionized various applications in natural language processing and exhibited proficiency in generating programming code. We propose a framework for evaluating the code generation ability of LLMs and introduce a new metric, pass-ratio@n, which captures the granularity of accuracy according to the pass rate of test cases. The framework is intended to be fully automatic to handle the repetitive work involved in generating prompts, conducting inferences, and executing the generated codes. A preliminary evaluation focusing on the prompt detail, problem publication date, and difficulty level demonstrates the successful integration of our framework with the LeetCode coding platform and highlights the applicability of the pass-ratio@n metric. KCI Citation Count: 0 Large language models (LLMs) have revolutionized various applications in natural language processing and exhibited proficiency in generating programming code. We propose a framework for evaluating the code generation ability of LLMs and introduce a new metric, pass‐ratio@n, which captures the granularity of accuracy according to the pass rate of test cases. The framework is intended to be fully automatic to handle the repetitive work involved in generating prompts, conducting inferences, and executing the generated codes. A preliminary evaluation focusing on the prompt detail, problem publication date, and difficulty level demonstrates the successful integration of our framework with the LeetCode coding platform and highlights the applicability of the pass‐ratio@n metric.
Author	Ma, Yu‐Seung Kim, Sang Cheol Yeo, Sangyeop Kim, Taeho Jun, Hyungkook
Author_xml	– sequence: 1 givenname: Sangyeop surname: Yeo fullname: Yeo, Sangyeop organization: University of Science and Technology – sequence: 2 givenname: Yu‐Seung orcidid: 0000-0002-4168-5515 surname: Ma fullname: Ma, Yu‐Seung email: ysma@etri.re.kr organization: Electronics and Telecommunications Research Institute – sequence: 3 givenname: Sang Cheol orcidid: 0000-0002-1925-2588 surname: Kim fullname: Kim, Sang Cheol organization: Electronics and Telecommunications Research Institute – sequence: 4 givenname: Hyungkook surname: Jun fullname: Jun, Hyungkook organization: Electronics and Telecommunications Research Institute – sequence: 5 givenname: Taeho orcidid: 0000-0002-5061-206X surname: Kim fullname: Kim, Taeho organization: Electronics and Telecommunications Research Institute
BackLink	https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART003054740$$DAccess content in National Research Foundation of Korea (NRF)
BookMark	eNqFkc1r3DAQxUVJIZu05159LjiRRh-WjiEkzUIgEDZnMZYlo12vVWSnYf_7ar3JJVB60Uji_d6M9C7I2ZhGT8gPRq8EMH3t5xy3V0CB15TL5gtZAXBeNxzUGVkxAFkrofg5uZimLaVAhdQrsr7PuPdvKe-qkHLl_-DwinMc-8qlzle9H30u5zRW2MYhzocqhWrA3Puyjv0rls2-KIfpG_kacJj89_d6SV7u7za3D_Xj06_17c1j7QSnpu6AdthQFbDTRjtE7cpYKF3njQjKaRkcOMdaA0ZK1QSpoTFNKG9xzmvgl-TnyXfMwe5ctAnjUvtkd9nePG_WllFBQTFdxOuTuEu4tb9z3GM-LMRykXJvMc_RDd5SJkwnRCdaMEI0WntJVcu4UtxQbEPxkicvl9M0ZR-si_PyN3PGOJSm9hiEXYKwxyDsMYjCXX_iPub4N6FOxFsc_OF_cnu3eQYGShj-F0qHnjI
CitedBy_id	crossref_primary_10_4218_etr2_12666 crossref_primary_10_1145_3714464 crossref_primary_10_3390_app142110048
Cites_doi	10.4218/etrij.2019‐0396 10.3115/1073083.1073135 10.4218/etrij.2021‐0269 10.4218/etrij.2020‐0282 10.1126/science.abq1158 10.1109/COMPSAC57700.2023.00117 10.1109/ICSE48619.2023.00035 10.1145/3558489.3559072 10.1145/3524842.3528470 10.1145/3580305.3599790
ContentType	Journal Article
Copyright	1225‐6463/$ © 2024 ETRI
Copyright_xml	– notice: 1225‐6463/$ © 2024 ETRI
DBID	AAYXX CITATION DOA ACYCR
DOI	10.4218/etrij.2023-0357
DatabaseName	CrossRef DOAJ Directory of Open Access Journals Korean Citation Index
DatabaseTitle	CrossRef
DatabaseTitleList	CrossRef
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISSN	2233-7326
EndPage	117
ExternalDocumentID	oai_kci_go_kr_ARTI_10402618 oai_doaj_org_article_0149d44d4b2944788e506b1366390abf 10_4218_etrij_2023_0357 ETR212649
Genre	article
GrantInformation_xml	– fundername: National Research Council of Science & Technology (NST) funderid: Global‐23‐001 – fundername: Institute of Information & communications Technology Planning & Evaluation funderid: 2022‐0‐00995
GroupedDBID	-~X .4S .DC .UV 0R~ 1OC 29G 2WC 5GY 5VS 9ZL AAKPC AAYBS ACGFS ACXQS ACYCR ADBBV ADDVE AENEX ALMA_UNASSIGNED_HOLDINGS ARCSS AVUZU BCNDV DU5 E3Z EBS EDO EJD GROUPED_DOAJ IPNFZ ITG ITH JDI KQ8 KVFHK MK~ ML~ O9- OK1 P5Y RIG RNS TR2 TUS WIN XSB AAYXX ADMLS CITATION OVT AAMMB AEFGJ AGXDD AIDQK AIDYY
ID	FETCH-LOGICAL-c4309-d20da706fad898caa8c646a5cde94f6c85fc2cc1b9295567f582797f233cce823
IEDL.DBID	DOA
ISSN	1225-6463
IngestDate	Sat Mar 02 03:21:41 EST 2024 Wed Aug 27 01:30:50 EDT 2025 Thu Apr 24 23:00:24 EDT 2025 Tue Jul 01 02:03:21 EDT 2025 Wed Jan 22 16:14:26 EST 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	1
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c4309-d20da706fad898caa8c646a5cde94f6c85fc2cc1b9295567f582797f233cce823
Notes	Funding information This work was supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (2022‐0‐00995, automated reliable source code generation from natural language descriptions, 95%) and a National Research Council of Science & Technology (NST) grant (Global‐23‐001, SeCode: Collaborative intelligent model for secure program code generator, 5%) funded by the Korea government (MSIT). https://doi.org/10.4218/etrij.2023-0357
ORCID	0000-0002-1925-2588 0000-0002-4168-5515 0000-0002-5061-206X
OpenAccessLink	https://doaj.org/article/0149d44d4b2944788e506b1366390abf
PageCount	12
ParticipantIDs	nrf_kci_oai_kci_go_kr_ARTI_10402618 doaj_primary_oai_doaj_org_article_0149d44d4b2944788e506b1366390abf crossref_citationtrail_10_4218_etrij_2023_0357 crossref_primary_10_4218_etrij_2023_0357 wiley_primary_10_4218_etrij_2023_0357_ETR212649
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	February 2024 2024-02-00 2024-02-01 2024-02
PublicationDateYYYYMMDD	2024-02-01
PublicationDate_xml	– month: 02 year: 2024 text: February 2024
PublicationDecade	2020
PublicationTitle	ETRI journal
PublicationYear	2024
Publisher	Electronics and Telecommunications Research Institute (ETRI) 한국전자통신연구원
Publisher_xml	– name: Electronics and Telecommunications Research Institute (ETRI) – name: 한국전자통신연구원
References	2021; 43 2022; 44 2002 2023 2022 2021 2020 2019; 32 2022; 378 e_1_2_9_20_1 e_1_2_9_11_1 e_1_2_9_10_1 e_1_2_9_21_1 e_1_2_9_13_1 e_1_2_9_12_1 e_1_2_9_8_1 e_1_2_9_7_1 e_1_2_9_6_1 e_1_2_9_5_1 e_1_2_9_4_1 e_1_2_9_3_1 e_1_2_9_2_1 e_1_2_9_9_1 e_1_2_9_15_1 e_1_2_9_14_1 e_1_2_9_17_1 e_1_2_9_16_1 e_1_2_9_19_1 e_1_2_9_18_1
References_xml	– start-page: 5673 year: 2023 end-page: 5684 – year: 2022 – year: 2021 – year: 2020 – volume: 378 start-page: 1092 issue: 6624 year: 2022 end-page: 1097 article-title: Competition‐level code generation with alphacode publication-title: Sci. – start-page: 283 year: 2023 end-page: 294 – year: 2023 – start-page: 876 year: 2023 end-page: 885 – start-page: 62 year: 2022 end-page: 71 – volume: 44 start-page: 794 issue: 5 year: 2022 end-page: 804 article-title: Comparative study of text representation and learning for persian named entity recognition publication-title: ETRI J. – volume: 43 start-page: 1038 issue: 6 year: 2021 end-page: 1048 article-title: Simple and effective neural coreference resolution for korean language publication-title: ETRI J. – volume: 44 start-page: 413 issue: 3 year: 2022 end-page: 425 article-title: Automatic extraction of similar poetry for study of literary texts: An experiment on hindi poetry publication-title: ETRI J. – volume: 32 year: 2019 – start-page: 311 year: 2002 end-page: 318 – start-page: 1 year: 2022 end-page: 5 – ident: e_1_2_9_4_1 doi: 10.4218/etrij.2019‐0396 – ident: e_1_2_9_21_1 – ident: e_1_2_9_10_1 – ident: e_1_2_9_5_1 – ident: e_1_2_9_8_1 doi: 10.3115/1073083.1073135 – ident: e_1_2_9_2_1 doi: 10.4218/etrij.2021‐0269 – ident: e_1_2_9_3_1 doi: 10.4218/etrij.2020‐0282 – ident: e_1_2_9_17_1 – ident: e_1_2_9_20_1 – ident: e_1_2_9_6_1 doi: 10.1126/science.abq1158 – ident: e_1_2_9_11_1 doi: 10.1109/COMPSAC57700.2023.00117 – ident: e_1_2_9_7_1 – ident: e_1_2_9_14_1 doi: 10.1109/ICSE48619.2023.00035 – ident: e_1_2_9_15_1 – ident: e_1_2_9_13_1 doi: 10.1145/3558489.3559072 – ident: e_1_2_9_18_1 – ident: e_1_2_9_19_1 – ident: e_1_2_9_9_1 – ident: e_1_2_9_12_1 doi: 10.1145/3524842.3528470 – ident: e_1_2_9_16_1 doi: 10.1145/3580305.3599790
SSID	ssj0020458
Score	2.4146042
Snippet	Large language models (LLMs) have revolutionized various applications in natural language processing and exhibited proficiency in generating programming code....
SourceID	nrf doaj crossref wiley
SourceType	Open Website Enrichment Source Index Database Publisher
StartPage	106
SubjectTerms	code generation evaluation metric large language model natural language processing software engineering 전자/정보통신공학
Title	Framework for evaluating code generation ability of large language models
URI	https://onlinelibrary.wiley.com/doi/abs/10.4218%2Fetrij.2023-0357 https://doaj.org/article/0149d44d4b2944788e506b1366390abf https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART003054740
Volume	46
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
ispartofPNX	ETRI Journal, 2024, 46(1), , pp.106-117
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09T8MwELUQEwyIT1G-ZAkGllA3sR17BETVIsGAWonNchy74kMpKmXg33PnJFUZUBemSFGcOO-svHex_Y6QCyulAOEbElXqHBIUz5Ii4z6xtvDAFpr5gJuTHx7lYMzvn8XzUqkvXBNW2wPXwHVRwpecl7xINUevdy-YLHoZMKVmtgj49WWatclUk2rh9B-mWjBaE8llVpv6cOCzLhaqer3CquEJy5CVlvgo2vYDy1Sz8FusRrbpb5OtRibS67p7O2TNV7tkc8k8cI8M--26KgrCk7a23dWE4jZ1Ool-0gg7ra24v-k00Hdc-E3bn5Q01sH53Cfj_t3odpA0hRESx3FGpExZaXMmgy2VVs5a5eAlrXCl1zxIp0RwqXO9ArSPEDIPQqW5zkOaZc55lWYHZL2aVv6Q0DKXogdhdBp4SuQWa48zFlJULkzlvEOuWniMa1zDsXjFu4HsAfE0EU-DeBrEs0MuFw0-asOMvy-9QbwXl6HTdTwB8TdN_M2q-HfIOUTLvLmX2B6Pk6l5mxnIB4bwZI5JpuqQbozmqi6Zu9ETkLnk-ug_OndMNuDWvF7lfULW57MvfwoiZl6cxfH6A3nq6Kk
linkProvider	Directory of Open Access Journals
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT9wwEB4V9gA9oD5ALPRhqT30EsgmfuW4rVjttsCh2kWIi-U49oqCsigsh_77zjjZFVRCVU-RItuxZzz5Zvz4BuCzlVKg4xsSXRUKAxSfJmXOfWJt6REtitQHupx8di7HM_79Ulw-ugvT8kOsF9zIMuL_mgycFqTJyjnCEmlx2Vz_OqL030maC7UBPYF4irO8N7yYXc3WYRdtBVLYhTM3kVzmLcEPNXL8VxNPsClS-CPi1E146rhG5Bm9gp3OZWTDVsev4YWv38DLR0SCb2EyWp2xYuiEshWFdz1ndGWdzSO3NKmAtbTcv9kisFs6BM5WC5Ys5sS534XZ6GT6bZx0SRISx2l3pMrSyqpUBlvpQjtrtcNBWuEqX_AgnRbBZc4NSvSDhJAqCJ2pQoUsz53zOsv3YLNe1H4fWKWkGKBKXYGYJZSlPORpGjLyYlKteB-OVuIxrmMQp0QWtwYjCZKnifI0JE9D8uzDl3WFu5Y84_miX0ne62LEeh1fLJq56YzIUDhXcV7xMis48f571Hg5yCV10JahD59QW-bGXcf69JwvzE1jMDaY4Jc5BZy6D8dRm__qkjmZ_kRgl7w4-O8aH2FrPD07NaeT8x-HsI0leHvU-x1sLpsH_x49mWX5oZuqfwD4COqr
linkToPdf	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1Lb9QwEB5BKyE4VDzF0gKW4MAlrTfxK8cCXXV5VAh1EeJiOX6sSqtsFZYD_74zTrKiSAhxihTZTjLjyTdjj78BeOmUkuj4psKEWmOAEnnRVCIWzjUR0aLmMdHh5I8n6ngh3n2VYzYhnYXp-SE2C25kGfl_TQZ-GRIZuUBUIiWuu7Pv-1T9u-CV1DdhG8FcYPy1ffhl8W2xibpoJ5CiLpy4hRKq6vl9aJCDP4a4Bk2ZwR8Bp-3Sdb81A8_sLuwMHiM77FV8D27E9j7c-Y1H8AHMZ2OKFUMflI0M3u2S0Yl1tszU0qQB1rNy_2KrxC4oB5yN65Usl8T58RAWs6PTN8fFUCOh8II2R0LJg9NcJRdMbbxzxuNHOulDrEVS3sjkS--nDbpBUiqdpCl1rVNZVd5HU1aPYKtdtfExsKCVnKJGfY2QJbWjMuScp5KcGG60mMD-KB7rBwJxqmNxYTGQIHnaLE9L8rQkzwm82nS47Lkz_t70Ncl704xIr_ONVbe0gw1ZiuaCEEE0ZS2I9j9KrppppegFXZMm8AK1Zc_9We5P1-XKnncWQ4M5PllQvGkmcJC1-a9XskennxHXlaif_HeP53Dr09uZ_TA_eb8Lt7GB6BO992Br3f2MT9GPWTfPhpl6BY-b6dQ
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Framework+for+evaluating+code+generation+ability+of+large+language+models&rft.jtitle=ETRI+journal&rft.au=%EC%97%AC%EC%83%81%EC%97%BD&rft.au=%EB%A7%88%EC%9C%A0%EC%8A%B9&rft.au=%EA%B9%80%EC%83%81%EC%B2%A0&rft.au=%EC%A0%84%ED%98%95%EA%B5%AD&rft.date=2024-02-01&rft.pub=%ED%95%9C%EA%B5%AD%EC%A0%84%EC%9E%90%ED%86%B5%EC%8B%A0%EC%97%B0%EA%B5%AC%EC%9B%90&rft.issn=1225-6463&rft.eissn=2233-7326&rft.spage=106&rft.epage=117&rft_id=info:doi/10.4218%2Fetrij.2023-0357&rft.externalDBID=n%2Fa&rft.externalDocID=oai_kci_go_kr_ARTI_10402618
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1225-6463&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1225-6463&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1225-6463&client=summon