ChatGPT vs. Human Annotators: A Comprehensive Analysis of ChatGPT for Text Annotation

In recent years, the field of Natural Language Processing (NLP) has witnessed a groundbreaking transformation with the emergence of large language models (LLMs). ChatGPT stands out as an example among these LLM models captivating considerable public interest due to its impressive language generation...

Full description

Saved in:
Bibliographic Details
Published inProceedings (IEEE International Conference on Emerging Technologies and Factory Automation) pp. 602 - 609
Main Authors Aldeen, Mohammed, Luo, Joshua, Lian, Ashley, Zheng, Venus, Hong, Allen, Yetukuri, Preethika, Cheng, Long
Format Conference Proceeding
LanguageEnglish
Published IEEE 15.12.2023
Subjects
Online AccessGet full text
ISSN1946-0759
DOI10.1109/ICMLA58977.2023.00089

Cover

More Information
Summary:In recent years, the field of Natural Language Processing (NLP) has witnessed a groundbreaking transformation with the emergence of large language models (LLMs). ChatGPT stands out as an example among these LLM models captivating considerable public interest due to its impressive language generation capabilities. Researchers have been exploring the potential of using ChatGPT for data annotation tasks, aiming to discover more timesaving and cost-effective approaches. In this paper, we present a comprehensive evaluation of ChatGPT's data annotation capabilities across ten diverse datasets covering various subject areas and varied number of classes. To ensure the quality of our evaluation, we leveraged datasets that were previously annotated by human experts, providing a reliable benchmark for comparison. Through rigorous experimentation, we assessed the impact of different prompt strategies and model configurations on the annotation performance. Our findings emphasize the capability of ChatGPT in handling most data annotation tasks achieving average accuracy of 78.2% across various tasks. The banking queries dataset stands out with an impressive 95.9% accuracy, while emotions classification presents challenges, yielding an accuracy of 57.5%. Our evaluation also highlights the impact of prompt strategies on annotation performance and reveals significant performance differences between GPT models, with "gpt-4" achieving higher accuracy 79.2% on average compared to "gpt-3.5" of 74.6%. Our research provides valuable insights into the capabilities and limitations of ChatGPT in automating data annotation tasks.
ISSN:1946-0759
DOI:10.1109/ICMLA58977.2023.00089