Comparing human coding to two natural language processing algorithms in aspirations of people affected by Duchenne Muscular Dystrophy

Qualitative methods can enhance our understanding of constructs that have not been well portrayed and enable nuanced depiction of experience from study participants who have not been broadly studied. However, qualitative data require time and effort to train raters to achieve validity and reliabilit...

Full description

Saved in:
Bibliographic Details
Published inJournal of methods and measurement in the social sciences Vol. 13; no. 1
Main Authors Schwartz, Carolyn E., Stark, Roland B., Biletch, Elijah, Stuart, Richard B.B.
Format Journal Article
LanguageEnglish
Published University of Arizona Libraries 01.10.2022
Subjects
Online AccessGet full text
ISSN2159-7855
2159-7855
DOI10.2458/jmmss.5397

Cover

Abstract Qualitative methods can enhance our understanding of constructs that have not been well portrayed and enable nuanced depiction of experience from study participants who have not been broadly studied. However, qualitative data require time and effort to train raters to achieve validity and reliability. This study compares recent advances in Natural Language Processing (NLP) models with human coding. This web-based study (N=1,253; 3,046 free-text entries, averaging 64 characters per entry) included people with Duchenne Muscular Dystrophy (DMD), their siblings, and a representative comparison group. Human raters (n=6) were trained over multiple sessions in content analysis as per a comprehensive codebook. Three prompts addressed distinct aspects of participants’ aspirations. Unsupervised NLP was implemented using Latent Dirichlet Allocation (LDA), which extracts latent topics across all the free-text entries. Supervised NLP was done using a Bidirectional Encoder Representations from Transformers (BERT) model, which requires training the algorithm to recognize relevant human-coded themes across free-text entries. We compared the human-, LDA-, and BERT-coded themes. Study sample contained 286 people with DMD, 355 DMD siblings, and 997 comparison participants, age 8-69. Human coders generated 95 codes across the three prompts and had an average inter-rater reliability (Fleiss’s kappa) of 0.77, with minimal rater-effect (pseudo R2=4%). Compared to human coders, LDA does not yield easily interpretable themes. BERT correctly classified only 61-70% of the validation set. LDA and BERT required technical expertise to program and took approximately 1.15 minutes per open-text entry, compared to 1.18 minutes for human raters including training time. LDA and BERT provide potentially viable approaches to analyzing large-scale qualitative data, but both have limitations. When text entries are short, LDA yields latent topics that are hard to interpret. BERT accurately identified only about two thirds of new statements. Humans provided reliable and cost-effective coding in the web-based context. The upfront training enables BERT to process enormous quantities of text data in future work, which should examine NLP’s predictive accuracy given different quantities of training data.
AbstractList Qualitative methods can enhance our understanding of constructs that have not been well portrayed and enable nuanced depiction of experience from study participants who have not been broadly studied. However, qualitative data require time and effort to train raters to achieve validity and reliability. This study compares recent advances in Natural Language Processing (NLP) models with human coding. This web-based study (N=1,253; 3,046 free-text entries, averaging 64 characters per entry) included people with Duchenne Muscular Dystrophy (DMD), their siblings, and a representative comparison group. Human raters (n=6) were trained over multiple sessions in content analysis as per a comprehensive codebook. Three prompts addressed distinct aspects of participants’ aspirations. Unsupervised NLP was implemented using Latent Dirichlet Allocation (LDA), which extracts latent topics across all the free-text entries. Supervised NLP was done using a Bidirectional Encoder Representations from Transformers (BERT) model, which requires training the algorithm to recognize relevant human-coded themes across free-text entries. We compared the human-, LDA-, and BERT-coded themes. Study sample contained 286 people with DMD, 355 DMD siblings, and 997 comparison participants, age 8-69. Human coders generated 95 codes across the three prompts and had an average inter-rater reliability (Fleiss’s kappa) of 0.77, with minimal rater-effect (pseudo R2=4%). Compared to human coders, LDA does not yield easily interpretable themes. BERT correctly classified only 61-70% of the validation set. LDA and BERT required technical expertise to program and took approximately 1.15 minutes per open-text entry, compared to 1.18 minutes for human raters including training time. LDA and BERT provide potentially viable approaches to analyzing large-scale qualitative data, but both have limitations. When text entries are short, LDA yields latent topics that are hard to interpret. BERT accurately identified only about two thirds of new statements. Humans provided reliable and cost-effective coding in the web-based context. The upfront training enables BERT to process enormous quantities of text data in future work, which should examine NLP’s predictive accuracy given different quantities of training data.
Author Stark, Roland B.
Stuart, Richard B.B.
Schwartz, Carolyn E.
Biletch, Elijah
Author_xml – sequence: 1
  givenname: Carolyn E.
  surname: Schwartz
  fullname: Schwartz, Carolyn E.
  organization: DeltaQuest Foundation and Tufts University School of Medicine
– sequence: 2
  givenname: Roland B.
  surname: Stark
  fullname: Stark, Roland B.
  organization: DeltaQuest Foundation
– sequence: 3
  givenname: Elijah
  surname: Biletch
  fullname: Biletch, Elijah
  organization: DeltaQuest Foundation
– sequence: 4
  givenname: Richard B.B.
  surname: Stuart
  fullname: Stuart, Richard B.B.
  organization: DeltaQuest Foundation
BookMark eNp9kU1P3DAQhq2KSlDgwi_wGbRgJ57EOaKlH0ggLvQczdrjbFaOHdmJUH4A_7u7bFVx6lxm3tGj5_J-YychBmLsSorbQoG-2w1DzrdQNvUXdlZIaFa1Bjj5dJ-yy5x3Yj8ATVE1Z-x9HYcRUx86vp0HDNxEewhT5NNb5AGnOaHnHkM3Y0d8TNFQzgcEfRdTP22HzPvAMY99wqmPIfPo-Ehx9MTROTITWb5Z-MNsthQC8ec5m9lj4g9LnlIct8sF--rQZ7r8u8_Z7x_fX9e_Vk8vPx_X908rI8u6XqnKAglAMBvlrBRKk5a6wrrQylItTalBKFAaStwAOEKpa-2ELMDqpsbynD0evTbirh1TP2Ba2oh9-_GIqWsxTb3x1JKqRCHRCglOWQMNGKsVSEUFVaDc3nVzdM1hxOUNvf8nlKI9FNJ-FNIeCtnT10fapJhzIvc_-A_yH5F6
ContentType Journal Article
CorporateAuthor Memorial Sloan Kettering Cancer Center
CorporateAuthor_xml – name: Memorial Sloan Kettering Cancer Center
DBID AAYXX
CITATION
ADTOC
UNPAY
DOA
DOI 10.2458/jmmss.5397
DatabaseName CrossRef
Unpaywall for CDI: Periodical Content
Unpaywall
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
DatabaseTitleList CrossRef

Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: UNPAY
  name: Unpaywall
  url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
Discipline Social Sciences (General)
EISSN 2159-7855
ExternalDocumentID oai_doaj_org_article_e46021ad015f4dc595cd84514e2e654f
10.2458/jmmss.5397
10_2458_jmmss_5397
GroupedDBID 5VS
AAYXX
ADBBV
AFMMW
AGGFP
ALMA_UNASSIGNED_HOLDINGS
BCNDV
CITATION
GROUPED_DOAJ
KQ8
M~E
OK1
ADTOC
IPNFZ
RIG
UNPAY
ID FETCH-LOGICAL-c1377-46d5e05a5cb4fd1048e8186a7284de71c3850454853ab55fea1878f0125d897a3
IEDL.DBID DOA
ISSN 2159-7855
IngestDate Fri Oct 03 12:52:26 EDT 2025
Tue Aug 19 22:25:45 EDT 2025
Tue Jul 01 01:41:33 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License https://creativecommons.org/licenses/by-nc-nd/4.0
cc-by-nc-nd
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c1377-46d5e05a5cb4fd1048e8186a7284de71c3850454853ab55fea1878f0125d897a3
OpenAccessLink https://doaj.org/article/e46021ad015f4dc595cd84514e2e654f
ParticipantIDs doaj_primary_oai_doaj_org_article_e46021ad015f4dc595cd84514e2e654f
unpaywall_primary_10_2458_jmmss_5397
crossref_primary_10_2458_jmmss_5397
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2022-10-1
2022-10-01
PublicationDateYYYYMMDD 2022-10-01
PublicationDate_xml – month: 10
  year: 2022
  text: 2022-10-1
  day: 01
PublicationDecade 2020
PublicationTitle Journal of methods and measurement in the social sciences
PublicationYear 2022
Publisher University of Arizona Libraries
Publisher_xml – name: University of Arizona Libraries
SSID ssj0000559269
Score 2.1990378
Snippet Qualitative methods can enhance our understanding of constructs that have not been well portrayed and enable nuanced depiction of experience from study...
SourceID doaj
unpaywall
crossref
SourceType Open Website
Open Access Repository
Index Database
SubjectTerms efficiency
human
natural language processing
qualitative data
SummonAdditionalLinks – databaseName: Unpaywall
  dbid: UNPAY
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Nj9MwELWW7gFx4BtRBGgk9gCHNG3jSdzjsstqhcSKA5WWU2THdtmljaN8aNW9878Zx2kFQkIcuEbjWPFM_J6TmTeMHU1nxqoEVWQEVxG3qCORKh1xPUdhEy607LN8L9LzJf94iZcHbCeoMCzgvrqu2n-LmdDR8ZbIab8DX282TRMPaxuT15BgNdZeXt5JHVfaxnfYYYpEz0fscHnx-firbzJH0B1lAjHIlM45inCniR_-GzD1-v332N2urOT2Rq7Xv4DO2QNW70p3Qq7J90nXqklx-6eS4_97nofs_kBR4TjYPWIHpnzMxqGOF4a9oIG3g2D1uyfsx0loZliuoG_5B4XziAitg_bGQa8dSkN3n0ahCsUJ3kSuV66-ar9tGrgqQYbf_v5NAGchZLeD7FNOjAa1hdOOgoygAT51IYMWTrdNWzsKlqdsefbhy8l5NLR3iAovcxjxVKOZosRCcavpWCiMl9eTGSGmNtmsSAR6gUAiFFIhWiNnIhOWEBW1WGQyecZGpSvNcwba074sy2bWC69TUPI0mU9RL0QmbZEsxuzNzrd5FVQ8cjr9-AjI-7XP_YKP2Xvv9r2FV97uL7h6lQ_OyQ1PiRZJTTTKcl3gwqsrcJrfzE2K3I7Z0T5o_jLXi38ze8lGbd2ZV0R8WvV6COWfz4QQQg
  priority: 102
  providerName: Unpaywall
Title Comparing human coding to two natural language processing algorithms in aspirations of people affected by Duchenne Muscular Dystrophy
URI http://journals.librarypublishing.arizona.edu/jmmss/article/id/5397/download/pdf/
https://doaj.org/article/e46021ad015f4dc595cd84514e2e654f
UnpaywallVersion publishedVersion
Volume 13
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAFT
  databaseName: Open Access Digital Library
  customDbUrl:
  eissn: 2159-7855
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000559269
  issn: 2159-7855
  databaseCode: KQ8
  dateStart: 20100101
  isFulltext: true
  titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html
  providerName: Colorado Alliance of Research Libraries
– providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2159-7855
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000559269
  issn: 2159-7855
  databaseCode: DOA
  dateStart: 20100101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2159-7855
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000559269
  issn: 2159-7855
  databaseCode: M~E
  dateStart: 20100101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09b9swECWCZGg6BGmbok6b4IB4aAbVtsSTqDFfhhEgRocaSCaBEkk3gS0ZlozAP6D_uzxSNjy1S1aBAgW-E99ReveOsW5_oE0eYR5owfOAG1SBiHMVcBWiMBEXSjqV7zgeTfj9Iz7utPoiTZi3B_YL19M8tjQklaUtw1WBKVWzc0vzOtQxckO7b1-kO4cp7-qNaRin3o805Ch6L_N5Xf_AiNyddhjIGfW_Z-9W5UKuX-VstsMuw2N21KaFcOUf5wPb0-VH1vG1s9C-fzV8b02iLz-xPze-gWA5BddmD4qKWAiaCprXCpxfp7118zkSFr4ggIbI2bRaPje_5zU8lyD9r3aKPqgMeEU5SCfz0AryNdyuLLB2O4aHlVetwu26bpaVBeiETYZ3v25GQdtSISjIWjDgsULdR4lFzo2yRzGhydJOJpallE4GRSSQTPksicsc0Wg5EIkwlsVQiTSR0We2X1al_sJAUaqVJMnAkNm5DQQeR2EfVSoSaYoo7bCLzTJnC--ckdkTB4GROTAyAqPDrgmB7Qhyu3YXbAxkbQxk_4uBDutu8fvHXKdvMddXdhhSBYTT831j-81ypc9sXtLk5y4Ez9nBZPzz6ukvn83ltw
linkProvider Directory of Open Access Journals
linkToUnpaywall http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Nj9MwELWW7gFx4BtRBGgk9gCHNG3jSdzjsstqhcSKA5WWU2THdtmljaN8aNW9878Zx2kFQkIcuEbjWPFM_J6TmTeMHU1nxqoEVWQEVxG3qCORKh1xPUdhEy607LN8L9LzJf94iZcHbCeoMCzgvrqu2n-LmdDR8ZbIab8DX282TRMPaxuT15BgNdZeXt5JHVfaxnfYYYpEz0fscHnx-firbzJH0B1lAjHIlM45inCniR_-GzD1-v332N2urOT2Rq7Xv4DO2QNW70p3Qq7J90nXqklx-6eS4_97nofs_kBR4TjYPWIHpnzMxqGOF4a9oIG3g2D1uyfsx0loZliuoG_5B4XziAitg_bGQa8dSkN3n0ahCsUJ3kSuV66-ar9tGrgqQYbf_v5NAGchZLeD7FNOjAa1hdOOgoygAT51IYMWTrdNWzsKlqdsefbhy8l5NLR3iAovcxjxVKOZosRCcavpWCiMl9eTGSGmNtmsSAR6gUAiFFIhWiNnIhOWEBW1WGQyecZGpSvNcwba074sy2bWC69TUPI0mU9RL0QmbZEsxuzNzrd5FVQ8cjr9-AjI-7XP_YKP2Xvv9r2FV97uL7h6lQ_OyQ1PiRZJTTTKcl3gwqsrcJrfzE2K3I7Z0T5o_jLXi38ze8lGbd2ZV0R8WvV6COWfz4QQQg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Comparing+human+coding+to+two+natural+language+processing+algorithms+in+aspirations+of+people+affected+by+Duchenne+Muscular+Dystrophy&rft.jtitle=Journal+of+methods+and+measurement+in+the+social+sciences&rft.au=Carolyn+Emily+Schwartz&rft.date=2022-10-01&rft.pub=University+of+Arizona+Libraries&rft.eissn=2159-7855&rft.volume=13&rft.issue=1&rft_id=info:doi/10.2458%2Fjmmss.5397&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_e46021ad015f4dc595cd84514e2e654f
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2159-7855&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2159-7855&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2159-7855&client=summon