TOSWT: A Dataset for Tracing the Origins of Students' Writing Texts Polished by Large Language Models

In recent years, generative large language models (LLMs) have undergone rapid development, producing content that is nearly indistinguishable from human-written text. While this advancement has found widespread application across various fields, it has also raised significant concerns among educator...

Full description

Saved in:
Bibliographic Details
Published in2025 13th International Conference on Information and Education Technology (ICIET) pp. 456 - 461
Main Authors Lu, Pinren, Lin, Zhifeng, Zhang, Lin, Liu, Jiawen, Qu, Shaojie, Li, Kan
Format Conference Proceeding
LanguageEnglish
Published IEEE 18.04.2025
Subjects
Online AccessGet full text
DOI10.1109/ICIET66371.2025.11046280

Cover

More Information
Summary:In recent years, generative large language models (LLMs) have undergone rapid development, producing content that is nearly indistinguishable from human-written text. While this advancement has found widespread application across various fields, it has also raised significant concerns among educators regarding the authenticity of student submissions. Consequently, addressing the misuse of AI-generated text (AIGT) in the educational sector has become an urgent priority. Current detection strategies primarily focus on whole documents, which do not fully satisfy practical requirements. Due to the likelihood that students may modify AI-generated content to some extent before incorporating it into their essays, fine-grained detection, particularly at the sentence level, is of paramount importance. Consequently, the task of tracing text provenance has increasingly garnered attention. In light of this, this study innovatively proposes the task of text provenance tracing within the educational domain and constructs a corresponding dataset named TOSWT (Tracing the Origins of Students' Writing Texts). This dataset, which comprises texts generated by five outstanding large language models, is based on argumentative essays written by students and contains a total of 53,328 document-level and 147,976 sentencelevel data samples. The study evaluates multiple deep learning detection models through experimental assessments on both document-level and sentence-level data. The results indicate that the task of text provenance tracing is highly challenging, with the sentence-level task proving particularly difficult.
DOI:10.1109/ICIET66371.2025.11046280