Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses

Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indi...

Full description

Saved in:
Bibliographic Details
Published inDialogue and discourse Vol. 14; no. 1; pp. 1 - 33
Main Authors Kumar, Yaman, Parekh, Swapnil, Singh, Somesh, Li, Junyi Jessy, Shah, Rajiv Ratn, Chen, Changyou
Format Journal Article
LanguageEnglish
Published Chatham Dialogue & Discourse 2023
Subjects
Online AccessGet full text
ISSN2152-9620
2152-9620
DOI10.5210/dad.2023.101

Cover

Abstract Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully.
AbstractList Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully.
Author Kumar, Yaman
Parekh, Swapnil
Li, Junyi Jessy
Singh, Somesh
Shah, Rajiv Ratn
Chen, Changyou
Author_xml – sequence: 1
  givenname: Yaman
  surname: Kumar
  fullname: Kumar, Yaman
– sequence: 2
  givenname: Swapnil
  surname: Parekh
  fullname: Parekh, Swapnil
– sequence: 3
  givenname: Somesh
  surname: Singh
  fullname: Singh, Somesh
– sequence: 4
  givenname: Junyi Jessy
  surname: Li
  fullname: Li, Junyi Jessy
– sequence: 5
  givenname: Rajiv Ratn
  surname: Shah
  fullname: Shah, Rajiv Ratn
– sequence: 6
  givenname: Changyou
  surname: Chen
  fullname: Chen, Changyou
BookMark eNp9kF1LwzAUhoNMcM7d-QMC3tqZpPbLuzrnBwwmTPGyJO2py-iSmqTT_ntb64UIem7OB895eXmP0UhpBQidUjILGCUXBS9mjDB_Rgk9QGNGA-YlISOjH_MRmlq7JV35SZTEyRhVaeP0jjuZ44W1vMXrXBupXvG6tQ52FqcG8LV2G7zag7GOiwpwqophBWWlk3u4wouPuuJS9Z8vm_aLeDS61ra_3EDZkWBP0GHJKwvT7z5Bz7eLp_m9t1zdPczTpZczGlKPijwIWBkKEcex8IEUfiTADwoSJjkhlIdlyXgRlFyQWEQiyIGTy4QWMQuA8dCfIG_QbVTN23deVVlt5I6bNqMk69PKurSyPq3uQDv-bOBro98asC7b6saozmLGYtJ5iQjpVdlA5UZba6DMcum65LRyhsvqL-nzX0__OvkEOGaLSA
CitedBy_id crossref_primary_10_1016_j_csl_2024_101700
crossref_primary_10_1145_3609468_3609474
crossref_primary_10_1016_j_compeleceng_2024_109308
ContentType Journal Article
Copyright Copyright Dialogue & Discourse 2023
Copyright_xml – notice: Copyright Dialogue & Discourse 2023
DBID AAYXX
CITATION
7T9
ADTOC
UNPAY
DOI 10.5210/dad.2023.101
DatabaseName CrossRef
Linguistics and Language Behavior Abstracts (LLBA)
Unpaywall for CDI: Periodical Content
Unpaywall
DatabaseTitle CrossRef
Linguistics and Language Behavior Abstracts (LLBA)
DatabaseTitleList CrossRef
Linguistics and Language Behavior Abstracts (LLBA)
Database_xml – sequence: 1
  dbid: UNPAY
  name: Unpaywall
  url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
Discipline Languages & Literatures
EISSN 2152-9620
EndPage 33
ExternalDocumentID 10.5210/dad.2023.101
10_5210_dad_2023_101
GroupedDBID 5VS
AAYXX
ADBBV
ALMA_UNASSIGNED_HOLDINGS
BCNDV
CITATION
GROUPED_DOAJ
H13
KWQ
M~E
OK1
TR2
7T9
ADTOC
C1A
UNPAY
ID FETCH-LOGICAL-c2161-1bc552f6bb888b3e0d37be35d069c001a6ff2ad5fab08b7b5cea0491d825e2a63
IEDL.DBID UNPAY
ISSN 2152-9620
IngestDate Wed Oct 01 16:38:26 EDT 2025
Mon Jun 30 06:23:00 EDT 2025
Thu Apr 24 23:10:13 EDT 2025
Tue Jul 01 01:22:48 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License cc-by
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c2161-1bc552f6bb888b3e0d37be35d069c001a6ff2ad5fab08b7b5cea0491d825e2a63
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
OpenAccessLink https://proxy.k.utb.cz/login?url=https://journals.uic.edu/ojs/index.php/dad/article/download/12448/11012
PQID 2805527006
PQPubID 2037706
PageCount 33
ParticipantIDs unpaywall_primary_10_5210_dad_2023_101
proquest_journals_2805527006
crossref_citationtrail_10_5210_dad_2023_101
crossref_primary_10_5210_dad_2023_101
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2023-00-00
PublicationDateYYYYMMDD 2023-01-01
PublicationDate_xml – year: 2023
  text: 2023-00-00
PublicationDecade 2020
PublicationPlace Chatham
PublicationPlace_xml – name: Chatham
PublicationTitle Dialogue and discourse
PublicationYear 2023
Publisher Dialogue & Discourse
Publisher_xml – name: Dialogue & Discourse
SSID ssj0000397989
Score 2.2862344
Snippet Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little...
SourceID unpaywall
proquest
crossref
SourceType Open Access Repository
Aggregation Database
Enrichment Source
Index Database
StartPage 1
SubjectTerms Algorithms
Deep learning
Morphology
Tests
Title Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses
URI https://www.proquest.com/docview/2805527006
https://journals.uic.edu/ojs/index.php/dad/article/download/12448/11012
UnpaywallVersion publishedVersion
Volume 14
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2152-9620
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000397989
  issn: 2152-9620
  databaseCode: DOA
  dateStart: 20100101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2152-9620
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000397989
  issn: 2152-9620
  databaseCode: M~E
  dateStart: 20100101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwEB6V7QEuUF5loVQ-QC8om6wTO0lvW2ipUCmVYEU5RXZsq0CUrcgGtPz6ziTO8pBAHLgl0chKPCPPIzPfB_BEpjpDR5kGOucySLgTgSIkzMRl0qZKOBfR7PDrU3k8T16di_MNeDnMwvgdbCbtR8_y9akJO-BAQosIjTKh39LQEJr8Ah-Qi8rCKSFVXYNNKTAoH8Hm_PRs9oGo5dBDBbnkUd_2LmhmBZeZEG845a6_OqQfUeb1tr5Uq2-qqn5yOEe34GJ41b7P5POkXepJ-f03FMf_8C1bcNMHpWzWi92GDVvfge0TX8ps2B47WaMvN3ehmrXLRQf1yg6bRq3Y27Jr42Me_hwXsuwATYC9-dpNzujKsllt-lvqmKczdp9R_19PUMHeX6w6iTMibaDyBXthHUra5h7Mjw7fPT8OPGtDUHIMH4OpLoXgTmqNybWObWTiVNtYmEjmJTpFJZ3jygindJTpVIvSKkxTpgZzVcuVjO_DqF7U9gEwKUyc6xxPiUgncaRUmUzpRynPU-UwzhzDs0F1RekhzYlZoyowtSFFF7jHBSmaOtnG8HQtfdlDefxBbmewgmJQYcGziLDq8Iwaw97aMv66zsN_FXwEN-iyL-zswGj5pbWPMdRZ6t2uRLDrjfkKrbMAgg
linkProvider Unpaywall
linkToUnpaywall http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwEB6V7QEu5U2XFuQD9IKyyTqxk3BboKVCpVSCFeUU2bGt0kbZimyoll_PTOIsDwnEgVsSjaxkZuSZcWa-D-CJTHWGgTINdM5lkHAnAkVImInLpE2VcC6i2eG3x_Jwnrw5Facb8HqYhfEabCbtZ8_ydd6EHXAgoUWERpnQqzQ0hCa_wAcUorJwSkhV12BTCkzKR7A5Pz6ZfSJqOYxQQS551Le9C5pZwWUmxBtOteuvAelHlnm9rS_V6kpV1U8B5-AmnA2v2veZXEzapZ6U335DcfwP33ILtnxSyma92G3YsPUdeHDkjzIbtseO1ujLzV2oZu1y0UG9sv2mUSv2vuza-JiHP8eFLHuBLsDefe0mZ3Rl2aw2_S11zNMe-5xR_19PUME-nq06iRMibaDjC_bKOpS0zT2YH-x_eHkYeNaGoOSYPgZTXQrBndQai2sd28jEqbaxMJHMSwyKSjrHlRFO6SjTqRalVVimTA3WqpYrGd-HUb2o7TYwKUyc6xx3iUgncaRUmUzpRynPU-UwzxzDs8F0RekhzYlZoyqwtCFDF6jjggxNnWxjeLqWvuyhPP4gtzt4QTGYsOBZRFh1uEeNYW_tGX9d5-G_Cu7ADbrsD3Z2YbT80tpHmOos9WPvxt8BsHL_fg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Automatic+Essay+Scoring+Systems+Are+Both+Overstable+And+Oversensitive%3A+Explaining+Why+And+Proposing+Defenses&rft.jtitle=Dialogue+and+discourse&rft.au=Kumar%2C+Yaman&rft.au=Parekh%2C+Swapnil&rft.au=Singh%2C+Somesh&rft.au=Li%2C+Junyi+Jessy&rft.date=2023&rft.pub=Dialogue+%26+Discourse&rft.eissn=2152-9620&rft.volume=14&rft.issue=1&rft.spage=1&rft_id=info:doi/10.5210%2Fdad.2023.101&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2152-9620&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2152-9620&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2152-9620&client=summon