TOSWT: A Dataset for Tracing the Origins of Students' Writing Texts Polished by Large Language Models

In recent years, generative large language models (LLMs) have undergone rapid development, producing content that is nearly indistinguishable from human-written text. While this advancement has found widespread application across various fields, it has also raised significant concerns among educator...

Full description

Saved in:

Bibliographic Details
Published in	2025 13th International Conference on Information and Education Technology (ICIET) pp. 456 - 461
Main Authors	Lu, Pinren, Lin, Zhifeng, Zhang, Lin, Liu, Jiawen, Qu, Shaojie, Li, Kan
Format	Conference Proceeding
Language	English
Published	IEEE 18.04.2025
Subjects	Adaptation models Benchmark testing Data models Deep learning detect Education Graph neural networks large language model Large language models Text detection text provenance tracing origin Transfer learning Writing
Online Access	Get full text
DOI	10.1109/ICIET66371.2025.11046280

Cover

More Information
Summary:	In recent years, generative large language models (LLMs) have undergone rapid development, producing content that is nearly indistinguishable from human-written text. While this advancement has found widespread application across various fields, it has also raised significant concerns among educators regarding the authenticity of student submissions. Consequently, addressing the misuse of AI-generated text (AIGT) in the educational sector has become an urgent priority. Current detection strategies primarily focus on whole documents, which do not fully satisfy practical requirements. Due to the likelihood that students may modify AI-generated content to some extent before incorporating it into their essays, fine-grained detection, particularly at the sentence level, is of paramount importance. Consequently, the task of tracing text provenance has increasingly garnered attention. In light of this, this study innovatively proposes the task of text provenance tracing within the educational domain and constructs a corresponding dataset named TOSWT (Tracing the Origins of Students' Writing Texts). This dataset, which comprises texts generated by five outstanding large language models, is based on argumentative essays written by students and contains a total of 53,328 document-level and 147,976 sentencelevel data samples. The study evaluates multiple deep learning detection models through experimental assessments on both document-level and sentence-level data. The results indicate that the task of text provenance tracing is highly challenging, with the sentence-level task proving particularly difficult.
DOI:	10.1109/ICIET66371.2025.11046280