Is GPT-4 fair? An empirical analysis in automatic short answer grading
Short open-ended questions represent a central resource in formative and summative assessments both face-to-face and online settings, ranging from elementary to higher education. However, grading these questions remains challenging for instructors, raising attention to the field of Automatic Short A...
Saved in:
Published in | Computers and education. Artificial intelligence Vol. 8; p. 100428 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.06.2025
Elsevier |
Subjects | |
Online Access | Get full text |
ISSN | 2666-920X 2666-920X |
DOI | 10.1016/j.caeai.2025.100428 |
Cover
Summary: | Short open-ended questions represent a central resource in formative and summative assessments both face-to-face and online settings, ranging from elementary to higher education. However, grading these questions remains challenging for instructors, raising attention to the field of Automatic Short Answer Grading (ASAG). While ASAG has yielded valuable contributions to learning analytics, it often faces generalizability issues. Accordingly, the rapid advancement in Large Language Models (LLMs) has motivated their adoption to empower ASAG systems. Despite that, previous research has not investigated whether LLMs are fair graders in the context of ASAG. Therefore, this paper presents an empirical analysis aimed to understand LLMs' fairness in ASAG by using human grades as a baseline, comparing them to GPT-4's answers, and investigating whether the LLM's grades are equivalent in grading answers from varied groups of humans. Our results demonstrated that, while GPT-4 tended to be more lenient in its grading, it maintained consistent evaluation standards when assessing responses from different groups of students. GPT-4 remained consistent for questions of different subjects and levels of Bloom's taxonomy and for people with different demographics. These findings suggest GPT-4 is a fair grader, supporting its potential to empower educators and developers in using and designing ASAG systems. Nevertheless, we recommend further research to investigate these findings and understand how to optimize GPT-4's grades. |
---|---|
ISSN: | 2666-920X 2666-920X |
DOI: | 10.1016/j.caeai.2025.100428 |