Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses
Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indi...
Saved in:
| Published in | Dialogue and discourse Vol. 14; no. 1; pp. 1 - 33 |
|---|---|
| Main Authors | , , , , , |
| Format | Journal Article |
| Language | English |
| Published |
Chatham
Dialogue & Discourse
2023
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 2152-9620 2152-9620 |
| DOI | 10.5210/dad.2023.101 |
Cover
| Abstract | Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully. |
|---|---|
| AbstractList | Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully. |
| Author | Kumar, Yaman Parekh, Swapnil Li, Junyi Jessy Singh, Somesh Shah, Rajiv Ratn Chen, Changyou |
| Author_xml | – sequence: 1 givenname: Yaman surname: Kumar fullname: Kumar, Yaman – sequence: 2 givenname: Swapnil surname: Parekh fullname: Parekh, Swapnil – sequence: 3 givenname: Somesh surname: Singh fullname: Singh, Somesh – sequence: 4 givenname: Junyi Jessy surname: Li fullname: Li, Junyi Jessy – sequence: 5 givenname: Rajiv Ratn surname: Shah fullname: Shah, Rajiv Ratn – sequence: 6 givenname: Changyou surname: Chen fullname: Chen, Changyou |
| BookMark | eNp9kF1LwzAUhoNMcM7d-QMC3tqZpPbLuzrnBwwmTPGyJO2py-iSmqTT_ntb64UIem7OB895eXmP0UhpBQidUjILGCUXBS9mjDB_Rgk9QGNGA-YlISOjH_MRmlq7JV35SZTEyRhVaeP0jjuZ44W1vMXrXBupXvG6tQ52FqcG8LV2G7zag7GOiwpwqophBWWlk3u4wouPuuJS9Z8vm_aLeDS61ra_3EDZkWBP0GHJKwvT7z5Bz7eLp_m9t1zdPczTpZczGlKPijwIWBkKEcex8IEUfiTADwoSJjkhlIdlyXgRlFyQWEQiyIGTy4QWMQuA8dCfIG_QbVTN23deVVlt5I6bNqMk69PKurSyPq3uQDv-bOBro98asC7b6saozmLGYtJ5iQjpVdlA5UZba6DMcum65LRyhsvqL-nzX0__OvkEOGaLSA |
| CitedBy_id | crossref_primary_10_1016_j_csl_2024_101700 crossref_primary_10_1145_3609468_3609474 crossref_primary_10_1016_j_compeleceng_2024_109308 |
| ContentType | Journal Article |
| Copyright | Copyright Dialogue & Discourse 2023 |
| Copyright_xml | – notice: Copyright Dialogue & Discourse 2023 |
| DBID | AAYXX CITATION 7T9 ADTOC UNPAY |
| DOI | 10.5210/dad.2023.101 |
| DatabaseName | CrossRef Linguistics and Language Behavior Abstracts (LLBA) Unpaywall for CDI: Periodical Content Unpaywall |
| DatabaseTitle | CrossRef Linguistics and Language Behavior Abstracts (LLBA) |
| DatabaseTitleList | CrossRef Linguistics and Language Behavior Abstracts (LLBA) |
| Database_xml | – sequence: 1 dbid: UNPAY name: Unpaywall url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/ sourceTypes: Open Access Repository |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Languages & Literatures |
| EISSN | 2152-9620 |
| EndPage | 33 |
| ExternalDocumentID | 10.5210/dad.2023.101 10_5210_dad_2023_101 |
| GroupedDBID | 5VS AAYXX ADBBV ALMA_UNASSIGNED_HOLDINGS BCNDV CITATION GROUPED_DOAJ H13 KWQ M~E OK1 TR2 7T9 ADTOC C1A UNPAY |
| ID | FETCH-LOGICAL-c2161-1bc552f6bb888b3e0d37be35d069c001a6ff2ad5fab08b7b5cea0491d825e2a63 |
| IEDL.DBID | UNPAY |
| ISSN | 2152-9620 |
| IngestDate | Wed Oct 01 16:38:26 EDT 2025 Mon Jun 30 06:23:00 EDT 2025 Thu Apr 24 23:10:13 EDT 2025 Tue Jul 01 01:22:48 EDT 2025 |
| IsDoiOpenAccess | true |
| IsOpenAccess | true |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 1 |
| Language | English |
| License | cc-by |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c2161-1bc552f6bb888b3e0d37be35d069c001a6ff2ad5fab08b7b5cea0491d825e2a63 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| OpenAccessLink | https://proxy.k.utb.cz/login?url=https://journals.uic.edu/ojs/index.php/dad/article/download/12448/11012 |
| PQID | 2805527006 |
| PQPubID | 2037706 |
| PageCount | 33 |
| ParticipantIDs | unpaywall_primary_10_5210_dad_2023_101 proquest_journals_2805527006 crossref_citationtrail_10_5210_dad_2023_101 crossref_primary_10_5210_dad_2023_101 |
| ProviderPackageCode | CITATION AAYXX |
| PublicationCentury | 2000 |
| PublicationDate | 2023-00-00 |
| PublicationDateYYYYMMDD | 2023-01-01 |
| PublicationDate_xml | – year: 2023 text: 2023-00-00 |
| PublicationDecade | 2020 |
| PublicationPlace | Chatham |
| PublicationPlace_xml | – name: Chatham |
| PublicationTitle | Dialogue and discourse |
| PublicationYear | 2023 |
| Publisher | Dialogue & Discourse |
| Publisher_xml | – name: Dialogue & Discourse |
| SSID | ssj0000397989 |
| Score | 2.2862344 |
| Snippet | Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little... |
| SourceID | unpaywall proquest crossref |
| SourceType | Open Access Repository Aggregation Database Enrichment Source Index Database |
| StartPage | 1 |
| SubjectTerms | Algorithms Deep learning Morphology Tests |
| Title | Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses |
| URI | https://www.proquest.com/docview/2805527006 https://journals.uic.edu/ojs/index.php/dad/article/download/12448/11012 |
| UnpaywallVersion | publishedVersion |
| Volume | 14 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2152-9620 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000397989 issn: 2152-9620 databaseCode: DOA dateStart: 20100101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2152-9620 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000397989 issn: 2152-9620 databaseCode: M~E dateStart: 20100101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwEB6V7QEuUF5loVQ-QC8om6wTO0lvW2ipUCmVYEU5RXZsq0CUrcgGtPz6ziTO8pBAHLgl0chKPCPPIzPfB_BEpjpDR5kGOucySLgTgSIkzMRl0qZKOBfR7PDrU3k8T16di_MNeDnMwvgdbCbtR8_y9akJO-BAQosIjTKh39LQEJr8Ah-Qi8rCKSFVXYNNKTAoH8Hm_PRs9oGo5dBDBbnkUd_2LmhmBZeZEG845a6_OqQfUeb1tr5Uq2-qqn5yOEe34GJ41b7P5POkXepJ-f03FMf_8C1bcNMHpWzWi92GDVvfge0TX8ps2B47WaMvN3ehmrXLRQf1yg6bRq3Y27Jr42Me_hwXsuwATYC9-dpNzujKsllt-lvqmKczdp9R_19PUMHeX6w6iTMibaDyBXthHUra5h7Mjw7fPT8OPGtDUHIMH4OpLoXgTmqNybWObWTiVNtYmEjmJTpFJZ3jygindJTpVIvSKkxTpgZzVcuVjO_DqF7U9gEwKUyc6xxPiUgncaRUmUzpRynPU-UwzhzDs0F1RekhzYlZoyowtSFFF7jHBSmaOtnG8HQtfdlDefxBbmewgmJQYcGziLDq8Iwaw97aMv66zsN_FXwEN-iyL-zswGj5pbWPMdRZ6t2uRLDrjfkKrbMAgg |
| linkProvider | Unpaywall |
| linkToUnpaywall | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Lb9QwEB6V7QEu5U2XFuQD9IKyyTqxk3BboKVCpVSCFeUU2bGt0kbZimyoll_PTOIsDwnEgVsSjaxkZuSZcWa-D-CJTHWGgTINdM5lkHAnAkVImInLpE2VcC6i2eG3x_Jwnrw5Facb8HqYhfEabCbtZ8_ydd6EHXAgoUWERpnQqzQ0hCa_wAcUorJwSkhV12BTCkzKR7A5Pz6ZfSJqOYxQQS551Le9C5pZwWUmxBtOteuvAelHlnm9rS_V6kpV1U8B5-AmnA2v2veZXEzapZ6U335DcfwP33ILtnxSyma92G3YsPUdeHDkjzIbtseO1ujLzV2oZu1y0UG9sv2mUSv2vuza-JiHP8eFLHuBLsDefe0mZ3Rl2aw2_S11zNMe-5xR_19PUME-nq06iRMibaDjC_bKOpS0zT2YH-x_eHkYeNaGoOSYPgZTXQrBndQai2sd28jEqbaxMJHMSwyKSjrHlRFO6SjTqRalVVimTA3WqpYrGd-HUb2o7TYwKUyc6xx3iUgncaRUmUzpRynPU-UwzxzDs8F0RekhzYlZoyqwtCFDF6jjggxNnWxjeLqWvuyhPP4gtzt4QTGYsOBZRFh1uEeNYW_tGX9d5-G_Cu7ADbrsD3Z2YbT80tpHmOos9WPvxt8BsHL_fg |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Automatic+Essay+Scoring+Systems+Are+Both+Overstable+And+Oversensitive%3A+Explaining+Why+And+Proposing+Defenses&rft.jtitle=Dialogue+and+discourse&rft.au=Kumar%2C+Yaman&rft.au=Parekh%2C+Swapnil&rft.au=Singh%2C+Somesh&rft.au=Li%2C+Junyi+Jessy&rft.date=2023&rft.pub=Dialogue+%26+Discourse&rft.eissn=2152-9620&rft.volume=14&rft.issue=1&rft.spage=1&rft_id=info:doi/10.5210%2Fdad.2023.101&rft.externalDBID=NO_FULL_TEXT |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2152-9620&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2152-9620&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2152-9620&client=summon |