Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses

Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indi...

Full description

Saved in:

Bibliographic Details
Published in	Dialogue and discourse Vol. 14; no. 1; pp. 1 - 33
Main Authors	Kumar, Yaman, Parekh, Swapnil, Singh, Somesh, Li, Junyi Jessy, Shah, Rajiv Ratn, Chen, Changyou
Format	Journal Article
Language	English
Published	Chatham Dialogue & Discourse 2023
Subjects	Algorithms Deep learning Morphology Tests
Online Access	Get full text
ISSN	2152-9620 2152-9620
DOI	10.5210/dad.2023.101

Cover

Abstract	Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully.
AbstractList	Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully.
Author	Kumar, Yaman Parekh, Swapnil Li, Junyi Jessy Singh, Somesh Shah, Rajiv Ratn Chen, Changyou
Author_xml	– sequence: 1 givenname: Yaman surname: Kumar fullname: Kumar, Yaman – sequence: 2 givenname: Swapnil surname: Parekh fullname: Parekh, Swapnil – sequence: 3 givenname: Somesh surname: Singh fullname: Singh, Somesh – sequence: 4 givenname: Junyi Jessy surname: Li fullname: Li, Junyi Jessy – sequence: 5 givenname: Rajiv Ratn surname: Shah fullname: Shah, Rajiv Ratn – sequence: 6 givenname: Changyou surname: Chen fullname: Chen, Changyou
BookMark	eNp9kF1LwzAUhoNMcM7d-QMC3tqZpPbLuzrnBwwmTPGyJO2py-iSmqTT_ntb64UIem7OB895eXmP0UhpBQidUjILGCUXBS9mjDB_Rgk9QGNGA-YlISOjH_MRmlq7JV35SZTEyRhVaeP0jjuZ44W1vMXrXBupXvG6tQ52FqcG8LV2G7zag7GOiwpwqophBWWlk3u4wouPuuJS9Z8vm_aLeDS61ra_3EDZkWBP0GHJKwvT7z5Bz7eLp_m9t1zdPczTpZczGlKPijwIWBkKEcex8IEUfiTADwoSJjkhlIdlyXgRlFyQWEQiyIGTy4QWMQuA8dCfIG_QbVTN23deVVlt5I6bNqMk69PKurSyPq3uQDv-bOBro98asC7b6saozmLGYtJ5iQjpVdlA5UZba6DMcum65LRyhsvqL-nzX0__OvkEOGaLSA
CitedBy_id	crossref_primary_10_1016_j_csl_2024_101700 crossref_primary_10_1145_3609468_3609474 crossref_primary_10_1016_j_compeleceng_2024_109308
ContentType	Journal Article
Copyright	Copyright Dialogue & Discourse 2023
Copyright_xml	– notice: Copyright Dialogue & Discourse 2023
DBID	AAYXX CITATION 7T9 ADTOC UNPAY
DOI	10.5210/dad.2023.101
DatabaseName	CrossRef Linguistics and Language Behavior Abstracts (LLBA) Unpaywall for CDI: Periodical Content Unpaywall
DatabaseTitle	CrossRef Linguistics and Language Behavior Abstracts (LLBA)
DatabaseTitleList	CrossRef Linguistics and Language Behavior Abstracts (LLBA)
Database_xml	– sequence: 1 dbid: UNPAY name: Unpaywall url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/ sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
Discipline	Languages & Literatures
EISSN	2152-9620
EndPage	33
ExternalDocumentID	10.5210/dad.2023.101 10_5210_dad_2023_101
GroupedDBID	5VS AAYXX ADBBV ALMA_UNASSIGNED_HOLDINGS BCNDV CITATION GROUPED_DOAJ H13 KWQ M~E OK1 TR2 7T9 ADTOC C1A UNPAY
ID	FETCH-LOGICAL-c2161-1bc552f6bb888b3e0d37be35d069c001a6ff2ad5fab08b7b5cea0491d825e2a63
IEDL.DBID	UNPAY
ISSN	2152-9620
IngestDate	Wed Oct 01 16:38:26 EDT 2025 Mon Jun 30 06:23:00 EDT 2025 Thu Apr 24 23:10:13 EDT 2025 Tue Jul 01 01:22:48 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	1
Language	English
License	cc-by
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c2161-1bc552f6bb888b3e0d37be35d069c001a6ff2ad5fab08b7b5cea0491d825e2a63
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
OpenAccessLink	https://proxy.k.utb.cz/login?url=https://journals.uic.edu/ojs/index.php/dad/article/download/12448/11012
PQID	2805527006
PQPubID	2037706
PageCount	33
ParticipantIDs	unpaywall_primary_10_5210_dad_2023_101 proquest_journals_2805527006 crossref_citationtrail_10_5210_dad_2023_101 crossref_primary_10_5210_dad_2023_101
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2023-00-00
PublicationDateYYYYMMDD	2023-01-01
PublicationDate_xml	– year: 2023 text: 2023-00-00
PublicationDecade	2020
PublicationPlace	Chatham
PublicationPlace_xml	– name: Chatham
PublicationTitle	Dialogue and discourse
PublicationYear	2023
Publisher	Dialogue & Discourse
Publisher_xml	– name: Dialogue & Discourse
SSID	ssj0000397989
Score	2.2862344
Snippet	Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little...
SourceID	unpaywall proquest crossref
SourceType	Open Access Repository Aggregation Database Enrichment Source Index Database
StartPage	1
SubjectTerms	Algorithms Deep learning Morphology Tests
Title	Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses
URI	https://www.proquest.com/docview/2805527006 https://journals.uic.edu/ojs/index.php/dad/article/download/12448/11012
UnpaywallVersion	publishedVersion
Volume	14
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2152-9620 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000397989 issn: 2152-9620 databaseCode: DOA dateStart: 20100101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2152-9620 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000397989 issn: 2152-9620 databaseCode: M~E dateStart: 20100101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwEB6V7QEuUF5loVQ-QC8om6wTO0lvW2ipUCmVYEU5RXZsq0CUrcgGtPz6ziTO8pBAHLgl0chKPCPPIzPfB_BEpjpDR5kGOucySLgTgSIkzMRl0qZKOBfR7PDrU3k8T16di_MNeDnMwvgdbCbtR8_y9akJO-BAQosIjTKh39LQEJr8Ah-Qi8rCKSFVXYNNKTAoH8Hm_PRs9oGo5dBDBbnkUd_2LmhmBZeZEG845a6_OqQfUeb1tr5Uq2-qqn5yOEe34GJ41b7P5POkXepJ-f03FMf_8C1bcNMHpWzWi92GDVvfge0TX8ps2B47WaMvN3ehmrXLRQf1yg6bRq3Y27Jr42Me_hwXsuwATYC9-dpNzujKsllt-lvqmKczdp9R_19PUMHeX6w6iTMibaDyBXthHUra5h7Mjw7fPT8OPGtDUHIMH4OpLoXgTmqNybWObWTiVNtYmEjmJTpFJZ3jygindJTpVIvSKkxTpgZzVcuVjO_DqF7U9gEwKUyc6xxPiUgncaRUmUzpRynPU-UwzhzDs0F1RekhzYlZoyowtSFFF7jHBSmaOtnG8HQtfdlDefxBbmewgmJQYcGziLDq8Iwaw97aMv66zsN_FXwEN-iyL-zswGj5pbWPMdRZ6t2uRLDrjfkKrbMAgg
linkProvider	Unpaywall
linkToUnpaywall	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwEB6V7QEu5U2XFuQD9IKyyTqxk3BboKVCpVSCFeUU2bGt0kbZimyoll_PTOIsDwnEgVsSjaxkZuSZcWa-D-CJTHWGgTINdM5lkHAnAkVImInLpE2VcC6i2eG3x_Jwnrw5Facb8HqYhfEabCbtZ8_ydd6EHXAgoUWERpnQqzQ0hCa_wAcUorJwSkhV12BTCkzKR7A5Pz6ZfSJqOYxQQS551Le9C5pZwWUmxBtOteuvAelHlnm9rS_V6kpV1U8B5-AmnA2v2veZXEzapZ6U335DcfwP33ILtnxSyma92G3YsPUdeHDkjzIbtseO1ujLzV2oZu1y0UG9sv2mUSv2vuza-JiHP8eFLHuBLsDefe0mZ3Rl2aw2_S11zNMe-5xR_19PUME-nq06iRMibaDjC_bKOpS0zT2YH-x_eHkYeNaGoOSYPgZTXQrBndQai2sd28jEqbaxMJHMSwyKSjrHlRFO6SjTqRalVVimTA3WqpYrGd-HUb2o7TYwKUyc6xx3iUgncaRUmUzpRynPU-UwzxzDs8F0RekhzYlZoyqwtCFDF6jjggxNnWxjeLqWvuyhPP4gtzt4QTGYsOBZRFh1uEeNYW_tGX9d5-G_Cu7ADbrsD3Z2YbT80tpHmOos9WPvxt8BsHL_fg
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Automatic+Essay+Scoring+Systems+Are+Both+Overstable+And+Oversensitive%3A+Explaining+Why+And+Proposing+Defenses&rft.jtitle=Dialogue+and+discourse&rft.au=Kumar%2C+Yaman&rft.au=Parekh%2C+Swapnil&rft.au=Singh%2C+Somesh&rft.au=Li%2C+Junyi+Jessy&rft.date=2023&rft.pub=Dialogue+%26+Discourse&rft.eissn=2152-9620&rft.volume=14&rft.issue=1&rft.spage=1&rft_id=info:doi/10.5210%2Fdad.2023.101&rft.externalDBID=NO_FULL_TEXT
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2152-9620&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2152-9620&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2152-9620&client=summon