KFinEval-Pilot: A Comprehensive Benchmark Suite for Korean Financial Language Understanding
We introduce KFinEval-Pilot, a benchmark suite specifically designed to evaluate large language models (LLMs) in the Korean financial domain. Addressing the limitations of existing English-centric benchmarks, KFinEval-Pilot comprises over 1,000 curated questions across three critical areas: financia...
Saved in:
Main Authors | , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
16.04.2025
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2504.13216 |
Cover
Summary: | We introduce KFinEval-Pilot, a benchmark suite specifically designed to
evaluate large language models (LLMs) in the Korean financial domain.
Addressing the limitations of existing English-centric benchmarks,
KFinEval-Pilot comprises over 1,000 curated questions across three critical
areas: financial knowledge, legal reasoning, and financial toxicity. The
benchmark is constructed through a semi-automated pipeline that combines
GPT-4-generated prompts with expert validation to ensure domain relevance and
factual accuracy. We evaluate a range of representative LLMs and observe
notable performance differences across models, with trade-offs between task
accuracy and output safety across different model families. These results
highlight persistent challenges in applying LLMs to high-stakes financial
applications, particularly in reasoning and safety. Grounded in real-world
financial use cases and aligned with the Korean regulatory and linguistic
context, KFinEval-Pilot serves as an early diagnostic tool for developing safer
and more reliable financial AI systems. |
---|---|
DOI: | 10.48550/arxiv.2504.13216 |