Khayyam Challenge (PersianMMLU): Is Your LLM Truly Wise to The Persian Language?
Evaluating Large Language Models (LLMs) is challenging due to their generative nature, necessitating precise evaluation methodologies. Additionally, non-English LLM evaluation lags behind English, resulting in the absence or weakness of LLMs for many languages. In response to this necessity, we intr...
Saved in:
Main Authors | , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
09.04.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2404.06644 |
Cover
Summary: | Evaluating Large Language Models (LLMs) is challenging due to their
generative nature, necessitating precise evaluation methodologies.
Additionally, non-English LLM evaluation lags behind English, resulting in the
absence or weakness of LLMs for many languages. In response to this necessity,
we introduce Khayyam Challenge (also known as PersianMMLU), a meticulously
curated collection comprising 20,192 four-choice questions sourced from 38
diverse tasks extracted from Persian examinations, spanning a wide spectrum of
subjects, complexities, and ages. The primary objective of the Khayyam
Challenge is to facilitate the rigorous evaluation of LLMs that support the
Persian language. Distinctive features of the Khayyam Challenge are (i) its
comprehensive coverage of various topics, including literary comprehension,
mathematics, sciences, logic, intelligence testing, etc., aimed at assessing
different facets of LLMs such as language comprehension, reasoning, and
information retrieval across various educational stages, from lower primary
school to upper secondary school (ii) its inclusion of rich metadata such as
human response rates, difficulty levels, and descriptive answers (iii) its
utilization of new data to avoid data contamination issues prevalent in
existing frameworks (iv) its use of original, non-translated data tailored for
Persian speakers, ensuring the framework is free from translation challenges
and errors while encompassing cultural nuances (v) its inherent scalability for
future data updates and evaluations without requiring special human effort.
Previous works lacked an evaluation framework that combined all of these
features into a single comprehensive benchmark. Furthermore, we evaluate a wide
range of existing LLMs that support the Persian language, with statistical
analyses and interpretations of their outputs. |
---|---|
DOI: | 10.48550/arxiv.2404.06644 |