KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding

We introduce KFinEval-Pilot, a benchmark suite specifically designed to evaluate large language models (LLMs) in the Korean financial domain. Addressing the limitations of existing English-centric benchmarks, KFinEval-Pilot comprises over 1,000 curated questions across three critical areas: financia...

Full description

Saved in:

Bibliographic Details
Main Authors	Hwang, Bokwang, Lim, Seonkyu, Kim, Taewoong, Geun, Yongjae, Bang, Sunghyun, Park, Sohyun, Park, Jihyun, Lee, Myeonggyu, Lee, Jinwoo, Kim, Yerin, Yoo, Jinsun, Hong, Jingyeong, Park, Jina, Kim, Yongchan, Kim, Suhyun, Hahm, Younggyun, Lee, Yiseul, Kang, Yejee, Yoon, Chanhyuk, Lee, Chansu, Jeong, Heeyewon, Lee, Jiyeon, Gu, Seonhye, Kang, Hyebin, Cho, Yousang, Yoo, Hangyeol, Lim, KyungTae
Format	Journal Article
Language	English
Published	16.04.2025
Subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning
Online Access	Get full text
DOI	10.48550/arxiv.2504.13216

Cover

More Information
Summary:	We introduce KFinEval-Pilot, a benchmark suite specifically designed to evaluate large language models (LLMs) in the Korean financial domain. Addressing the limitations of existing English-centric benchmarks, KFinEval-Pilot comprises over 1,000 curated questions across three critical areas: financial knowledge, legal reasoning, and financial toxicity. The benchmark is constructed through a semi-automated pipeline that combines GPT-4-generated prompts with expert validation to ensure domain relevance and factual accuracy. We evaluate a range of representative LLMs and observe notable performance differences across models, with trade-offs between task accuracy and output safety across different model families. These results highlight persistent challenges in applying LLMs to high-stakes financial applications, particularly in reasoning and safety. Grounded in real-world financial use cases and aligned with the Korean regulatory and linguistic context, KFinEval-Pilot serves as an early diagnostic tool for developing safer and more reliable financial AI systems.
DOI:	10.48550/arxiv.2504.13216