Is GPT-4 fair? An empirical analysis in automatic short answer grading

Short open-ended questions represent a central resource in formative and summative assessments both face-to-face and online settings, ranging from elementary to higher education. However, grading these questions remains challenging for instructors, raising attention to the field of Automatic Short A...

Full description

Saved in:
Bibliographic Details
Published inComputers and education. Artificial intelligence Vol. 8; p. 100428
Main Authors Rodrigues, Luiz, Xavier, Cleon, Costa, Newarney, Gasevic, Dragan, Mello, Rafael Ferreira
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.06.2025
Elsevier
Subjects
Online AccessGet full text
ISSN2666-920X
2666-920X
DOI10.1016/j.caeai.2025.100428

Cover

More Information
Summary:Short open-ended questions represent a central resource in formative and summative assessments both face-to-face and online settings, ranging from elementary to higher education. However, grading these questions remains challenging for instructors, raising attention to the field of Automatic Short Answer Grading (ASAG). While ASAG has yielded valuable contributions to learning analytics, it often faces generalizability issues. Accordingly, the rapid advancement in Large Language Models (LLMs) has motivated their adoption to empower ASAG systems. Despite that, previous research has not investigated whether LLMs are fair graders in the context of ASAG. Therefore, this paper presents an empirical analysis aimed to understand LLMs' fairness in ASAG by using human grades as a baseline, comparing them to GPT-4's answers, and investigating whether the LLM's grades are equivalent in grading answers from varied groups of humans. Our results demonstrated that, while GPT-4 tended to be more lenient in its grading, it maintained consistent evaluation standards when assessing responses from different groups of students. GPT-4 remained consistent for questions of different subjects and levels of Bloom's taxonomy and for people with different demographics. These findings suggest GPT-4 is a fair grader, supporting its potential to empower educators and developers in using and designing ASAG systems. Nevertheless, we recommend further research to investigate these findings and understand how to optimize GPT-4's grades.
ISSN:2666-920X
2666-920X
DOI:10.1016/j.caeai.2025.100428