Comparing human coding to two natural language processing algorithms in aspirations of people affected by Duchenne Muscular Dystrophy

Qualitative methods can enhance our understanding of constructs that have not been well portrayed and enable nuanced depiction of experience from study participants who have not been broadly studied. However, qualitative data require time and effort to train raters to achieve validity and reliabilit...

Full description

Saved in:

Bibliographic Details
Published in	Journal of methods and measurement in the social sciences Vol. 13; no. 1
Main Authors	Schwartz, Carolyn E., Stark, Roland B., Biletch, Elijah, Stuart, Richard B.B.
Format	Journal Article
Language	English
Published	University of Arizona Libraries 01.10.2022
Subjects	efficiency human natural language processing qualitative data
Online Access	Get full text
ISSN	2159-7855 2159-7855
DOI	10.2458/jmmss.5397

Cover

Abstract	Qualitative methods can enhance our understanding of constructs that have not been well portrayed and enable nuanced depiction of experience from study participants who have not been broadly studied. However, qualitative data require time and effort to train raters to achieve validity and reliability. This study compares recent advances in Natural Language Processing (NLP) models with human coding. This web-based study (N=1,253; 3,046 free-text entries, averaging 64 characters per entry) included people with Duchenne Muscular Dystrophy (DMD), their siblings, and a representative comparison group. Human raters (n=6) were trained over multiple sessions in content analysis as per a comprehensive codebook. Three prompts addressed distinct aspects of participants’ aspirations. Unsupervised NLP was implemented using Latent Dirichlet Allocation (LDA), which extracts latent topics across all the free-text entries. Supervised NLP was done using a Bidirectional Encoder Representations from Transformers (BERT) model, which requires training the algorithm to recognize relevant human-coded themes across free-text entries. We compared the human-, LDA-, and BERT-coded themes. Study sample contained 286 people with DMD, 355 DMD siblings, and 997 comparison participants, age 8-69. Human coders generated 95 codes across the three prompts and had an average inter-rater reliability (Fleiss’s kappa) of 0.77, with minimal rater-effect (pseudo R2=4%). Compared to human coders, LDA does not yield easily interpretable themes. BERT correctly classified only 61-70% of the validation set. LDA and BERT required technical expertise to program and took approximately 1.15 minutes per open-text entry, compared to 1.18 minutes for human raters including training time. LDA and BERT provide potentially viable approaches to analyzing large-scale qualitative data, but both have limitations. When text entries are short, LDA yields latent topics that are hard to interpret. BERT accurately identified only about two thirds of new statements. Humans provided reliable and cost-effective coding in the web-based context. The upfront training enables BERT to process enormous quantities of text data in future work, which should examine NLP’s predictive accuracy given different quantities of training data.
AbstractList	Qualitative methods can enhance our understanding of constructs that have not been well portrayed and enable nuanced depiction of experience from study participants who have not been broadly studied. However, qualitative data require time and effort to train raters to achieve validity and reliability. This study compares recent advances in Natural Language Processing (NLP) models with human coding. This web-based study (N=1,253; 3,046 free-text entries, averaging 64 characters per entry) included people with Duchenne Muscular Dystrophy (DMD), their siblings, and a representative comparison group. Human raters (n=6) were trained over multiple sessions in content analysis as per a comprehensive codebook. Three prompts addressed distinct aspects of participants’ aspirations. Unsupervised NLP was implemented using Latent Dirichlet Allocation (LDA), which extracts latent topics across all the free-text entries. Supervised NLP was done using a Bidirectional Encoder Representations from Transformers (BERT) model, which requires training the algorithm to recognize relevant human-coded themes across free-text entries. We compared the human-, LDA-, and BERT-coded themes. Study sample contained 286 people with DMD, 355 DMD siblings, and 997 comparison participants, age 8-69. Human coders generated 95 codes across the three prompts and had an average inter-rater reliability (Fleiss’s kappa) of 0.77, with minimal rater-effect (pseudo R2=4%). Compared to human coders, LDA does not yield easily interpretable themes. BERT correctly classified only 61-70% of the validation set. LDA and BERT required technical expertise to program and took approximately 1.15 minutes per open-text entry, compared to 1.18 minutes for human raters including training time. LDA and BERT provide potentially viable approaches to analyzing large-scale qualitative data, but both have limitations. When text entries are short, LDA yields latent topics that are hard to interpret. BERT accurately identified only about two thirds of new statements. Humans provided reliable and cost-effective coding in the web-based context. The upfront training enables BERT to process enormous quantities of text data in future work, which should examine NLP’s predictive accuracy given different quantities of training data.
Author	Stark, Roland B. Stuart, Richard B.B. Schwartz, Carolyn E. Biletch, Elijah
Author_xml	– sequence: 1 givenname: Carolyn E. surname: Schwartz fullname: Schwartz, Carolyn E. organization: DeltaQuest Foundation and Tufts University School of Medicine – sequence: 2 givenname: Roland B. surname: Stark fullname: Stark, Roland B. organization: DeltaQuest Foundation – sequence: 3 givenname: Elijah surname: Biletch fullname: Biletch, Elijah organization: DeltaQuest Foundation – sequence: 4 givenname: Richard B.B. surname: Stuart fullname: Stuart, Richard B.B. organization: DeltaQuest Foundation
BookMark	eNp9kU1P3DAQhq2KSlDgwi_wGbRgJ57EOaKlH0ggLvQczdrjbFaOHdmJUH4A_7u7bFVx6lxm3tGj5_J-YychBmLsSorbQoG-2w1DzrdQNvUXdlZIaFa1Bjj5dJ-yy5x3Yj8ATVE1Z-x9HYcRUx86vp0HDNxEewhT5NNb5AGnOaHnHkM3Y0d8TNFQzgcEfRdTP22HzPvAMY99wqmPIfPo-Ehx9MTROTITWb5Z-MNsthQC8ec5m9lj4g9LnlIct8sF--rQZ7r8u8_Z7x_fX9e_Vk8vPx_X908rI8u6XqnKAglAMBvlrBRKk5a6wrrQylItTalBKFAaStwAOEKpa-2ELMDqpsbynD0evTbirh1TP2Ba2oh9-_GIqWsxTb3x1JKqRCHRCglOWQMNGKsVSEUFVaDc3nVzdM1hxOUNvf8nlKI9FNJ-FNIeCtnT10fapJhzIvc_-A_yH5F6
ContentType	Journal Article
CorporateAuthor	Memorial Sloan Kettering Cancer Center
CorporateAuthor_xml	– name: Memorial Sloan Kettering Cancer Center
DBID	AAYXX CITATION ADTOC UNPAY DOA
DOI	10.2458/jmmss.5397
DatabaseName	CrossRef Unpaywall for CDI: Periodical Content Unpaywall DOAJ Directory of Open Access Journals
DatabaseTitle	CrossRef
DatabaseTitleList	CrossRef
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: UNPAY name: Unpaywall url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/ sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
Discipline	Social Sciences (General)
EISSN	2159-7855
ExternalDocumentID	oai_doaj_org_article_e46021ad015f4dc595cd84514e2e654f 10.2458/jmmss.5397 10_2458_jmmss_5397
GroupedDBID	5VS AAYXX ADBBV AFMMW AGGFP ALMA_UNASSIGNED_HOLDINGS BCNDV CITATION GROUPED_DOAJ KQ8 M~E OK1 ADTOC IPNFZ RIG UNPAY
ID	FETCH-LOGICAL-c1377-46d5e05a5cb4fd1048e8186a7284de71c3850454853ab55fea1878f0125d897a3
IEDL.DBID	DOA
ISSN	2159-7855
IngestDate	Fri Oct 03 12:52:26 EDT 2025 Tue Aug 19 22:25:45 EDT 2025 Tue Jul 01 01:41:33 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	1
Language	English
License	https://creativecommons.org/licenses/by-nc-nd/4.0 cc-by-nc-nd
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c1377-46d5e05a5cb4fd1048e8186a7284de71c3850454853ab55fea1878f0125d897a3
OpenAccessLink	https://doaj.org/article/e46021ad015f4dc595cd84514e2e654f
ParticipantIDs	doaj_primary_oai_doaj_org_article_e46021ad015f4dc595cd84514e2e654f unpaywall_primary_10_2458_jmmss_5397 crossref_primary_10_2458_jmmss_5397
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2022-10-1 2022-10-01
PublicationDateYYYYMMDD	2022-10-01
PublicationDate_xml	– month: 10 year: 2022 text: 2022-10-1 day: 01
PublicationDecade	2020
PublicationTitle	Journal of methods and measurement in the social sciences
PublicationYear	2022
Publisher	University of Arizona Libraries
Publisher_xml	– name: University of Arizona Libraries
SSID	ssj0000559269
Score	2.1990378
Snippet	Qualitative methods can enhance our understanding of constructs that have not been well portrayed and enable nuanced depiction of experience from study...
SourceID	doaj unpaywall crossref
SourceType	Open Website Open Access Repository Index Database
SubjectTerms	efficiency human natural language processing qualitative data
SummonAdditionalLinks	– databaseName: Unpaywall dbid: UNPAY link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Nj9MwELWW7gFx4BtRBGgk9gCHNG3jSdzjsstqhcSKA5WWU2THdtmljaN8aNW9878Zx2kFQkIcuEbjWPFM_J6TmTeMHU1nxqoEVWQEVxG3qCORKh1xPUdhEy607LN8L9LzJf94iZcHbCeoMCzgvrqu2n-LmdDR8ZbIab8DX282TRMPaxuT15BgNdZeXt5JHVfaxnfYYYpEz0fscHnx-firbzJH0B1lAjHIlM45inCniR_-GzD1-v332N2urOT2Rq7Xv4DO2QNW70p3Qq7J90nXqklx-6eS4_97nofs_kBR4TjYPWIHpnzMxqGOF4a9oIG3g2D1uyfsx0loZliuoG_5B4XziAitg_bGQa8dSkN3n0ahCsUJ3kSuV66-ar9tGrgqQYbf_v5NAGchZLeD7FNOjAa1hdOOgoygAT51IYMWTrdNWzsKlqdsefbhy8l5NLR3iAovcxjxVKOZosRCcavpWCiMl9eTGSGmNtmsSAR6gUAiFFIhWiNnIhOWEBW1WGQyecZGpSvNcwba074sy2bWC69TUPI0mU9RL0QmbZEsxuzNzrd5FVQ8cjr9-AjI-7XP_YKP2Xvv9r2FV97uL7h6lQ_OyQ1PiRZJTTTKcl3gwqsrcJrfzE2K3I7Z0T5o_jLXi38ze8lGbd2ZV0R8WvV6COWfz4QQQg priority: 102 providerName: Unpaywall
Title	Comparing human coding to two natural language processing algorithms in aspirations of people affected by Duchenne Muscular Dystrophy
URI	http://journals.librarypublishing.arizona.edu/jmmss/article/id/5397/download/pdf/ https://doaj.org/article/e46021ad015f4dc595cd84514e2e654f
UnpaywallVersion	publishedVersion
Volume	13
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVAFT databaseName: Open Access Digital Library customDbUrl: eissn: 2159-7855 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000559269 issn: 2159-7855 databaseCode: KQ8 dateStart: 20100101 isFulltext: true titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html providerName: Colorado Alliance of Research Libraries – providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2159-7855 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000559269 issn: 2159-7855 databaseCode: DOA dateStart: 20100101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2159-7855 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000559269 issn: 2159-7855 databaseCode: M~E dateStart: 20100101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09b9swECWCZGg6BGmbok6b4IB4aAbVtsSTqDFfhhEgRocaSCaBEkk3gS0ZlozAP6D_uzxSNjy1S1aBAgW-E99ReveOsW5_oE0eYR5owfOAG1SBiHMVcBWiMBEXSjqV7zgeTfj9Iz7utPoiTZi3B_YL19M8tjQklaUtw1WBKVWzc0vzOtQxckO7b1-kO4cp7-qNaRin3o805Ch6L_N5Xf_AiNyddhjIGfW_Z-9W5UKuX-VstsMuw2N21KaFcOUf5wPb0-VH1vG1s9C-fzV8b02iLz-xPze-gWA5BddmD4qKWAiaCprXCpxfp7118zkSFr4ggIbI2bRaPje_5zU8lyD9r3aKPqgMeEU5SCfz0AryNdyuLLB2O4aHlVetwu26bpaVBeiETYZ3v25GQdtSISjIWjDgsULdR4lFzo2yRzGhydJOJpallE4GRSSQTPksicsc0Wg5EIkwlsVQiTSR0We2X1al_sJAUaqVJMnAkNm5DQQeR2EfVSoSaYoo7bCLzTJnC--ckdkTB4GROTAyAqPDrgmB7Qhyu3YXbAxkbQxk_4uBDutu8fvHXKdvMddXdhhSBYTT831j-81ypc9sXtLk5y4Ez9nBZPzz6ukvn83ltw
linkProvider	Directory of Open Access Journals
linkToUnpaywall	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Nj9MwELWW7gFx4BtRBGgk9gCHNG3jSdzjsstqhcSKA5WWU2THdtmljaN8aNW9878Zx2kFQkIcuEbjWPFM_J6TmTeMHU1nxqoEVWQEVxG3qCORKh1xPUdhEy607LN8L9LzJf94iZcHbCeoMCzgvrqu2n-LmdDR8ZbIab8DX282TRMPaxuT15BgNdZeXt5JHVfaxnfYYYpEz0fscHnx-firbzJH0B1lAjHIlM45inCniR_-GzD1-v332N2urOT2Rq7Xv4DO2QNW70p3Qq7J90nXqklx-6eS4_97nofs_kBR4TjYPWIHpnzMxqGOF4a9oIG3g2D1uyfsx0loZliuoG_5B4XziAitg_bGQa8dSkN3n0ahCsUJ3kSuV66-ar9tGrgqQYbf_v5NAGchZLeD7FNOjAa1hdOOgoygAT51IYMWTrdNWzsKlqdsefbhy8l5NLR3iAovcxjxVKOZosRCcavpWCiMl9eTGSGmNtmsSAR6gUAiFFIhWiNnIhOWEBW1WGQyecZGpSvNcwba074sy2bWC69TUPI0mU9RL0QmbZEsxuzNzrd5FVQ8cjr9-AjI-7XP_YKP2Xvv9r2FV97uL7h6lQ_OyQ1PiRZJTTTKcl3gwqsrcJrfzE2K3I7Z0T5o_jLXi38ze8lGbd2ZV0R8WvV6COWfz4QQQg
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Comparing+human+coding+to+two+natural+language+processing+algorithms+in+aspirations+of+people+affected+by+Duchenne+Muscular+Dystrophy&rft.jtitle=Journal+of+methods+and+measurement+in+the+social+sciences&rft.au=Carolyn+Emily+Schwartz&rft.date=2022-10-01&rft.pub=University+of+Arizona+Libraries&rft.eissn=2159-7855&rft.volume=13&rft.issue=1&rft_id=info:doi/10.2458%2Fjmmss.5397&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_e46021ad015f4dc595cd84514e2e654f
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2159-7855&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2159-7855&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2159-7855&client=summon