De Novo Natural Language Processing Algorithm Accurately Identifies Myxofibrosarcoma From Pathology Reports

BackgroundAvailable codes in the ICD-10 do not accurately reflect soft tissue sarcoma diagnoses, and this can result in an underrepresentation of soft tissue sarcoma in databases. The National VA Database provides a unique opportunity for soft tissue sarcoma investigation because of the availability...

Full description

Saved in:
Bibliographic Details
Published inClinical orthopaedics and related research Vol. 483; no. 1; pp. 80 - 87
Main Authors Lindsay, Sarah E., Madison, Cecelia J., Ramsey, Duncan C., Doung, Yee-Cheen, Gundle, Kenneth R.
Format Journal Article
LanguageEnglish
Published Philadelphia, PA Wolters Kluwer 01.01.2025
Lippincott Williams & Wilkins Ovid Technologies
Subjects
Online AccessGet full text
ISSN0009-921X
1528-1132
1528-1132
DOI10.1097/CORR.0000000000003270

Cover

More Information
Summary:BackgroundAvailable codes in the ICD-10 do not accurately reflect soft tissue sarcoma diagnoses, and this can result in an underrepresentation of soft tissue sarcoma in databases. The National VA Database provides a unique opportunity for soft tissue sarcoma investigation because of the availability of all clinical results and pathology reports. In the setting of soft tissue sarcoma, natural language processing (NLP) has the potential to be applied to clinical documents such as pathology reports to identify soft tissue sarcoma independent of ICD codes, allowing sarcoma researchers to build more comprehensive databases capable of answering a myriad of research questions.Questions/purposes(1) What proportion of patients with myxofibrosarcoma within the National VA Database would be missed by searching only by soft tissue sarcoma ICD codes? (2) Is a de novo NLP algorithm capable of analyzing pathology reports to accurately identify patients with myxofibrosarcoma?MethodsAll pathology reports (10.7 million) in the national VA corporate data warehouse were identified from 2003 to 2022. Using the word-search functionality, reports from 403 veterans were found to contain the term "myxofibrosarcoma." The resulting pathology reports were manually reviewed to develop a gold-standard cohort that contained only those veterans with pathologist-confirmed myxofibrosarcoma diagnoses. The cohort had a mean ± SD age of 70 ± 12 years, and 96% (287 of 300) were men. Diagnosis codes were abstracted, and differences in appropriate ICD coding were compared. An NLP algorithm was iteratively refined and tested using confounders, negation, and emphasis terms for myxofibrosarcoma. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were calculated for the NLP-generated cohorts through comparison with the manually reviewed gold-standard cohorts.ResultsThe records of 27% (81 of 300) of myxofibrosarcoma patients within the VA database were missing a sarcoma ICD code. A de novo NLP algorithm more accurately (92% [276 of 300]) identified patients with myxofibrosarcoma compared with ICD codes (73% [219 of 300]) or basic word searches (74% [300 of 403]) (p < 0.001). Three final algorithm models were generated with accuracies ranging from 92% to 100%.ConclusionAn NLP algorithm can identify patients with myxofibrosarcoma from pathology reports with high accuracy, which is an improvement over ICD-based cohort creation and simple word search. This algorithm is freely available on GitHub (https://github.com/sarcoma-shark/myxofibrosarcoma-shark) and is available to facilitate external validation and improvement through testing in other cohorts.Level of EvidenceLevel II, diagnostic study.
Bibliography:K. R. Gundle ✉, Department of Orthopaedics & Rehabilitation, Oregon Health & Science University, 3181 Sam Jackson Park Road L597, Portland, OR 97239, USA, Email: gundle@ohsu.eduThe institution of one or more of the authors (KRG) has received, during the study period, funding from the Northwest Sarcoma Foundation.Each author certifies that there are no funding or commercial associations (consultancies, stock ownership, equity interest, patent/licensing arrangements, etc.) that might pose a conflict of interest in connection with the submitted article related to the author or any immediate family members.All ICMJE Conflict of Interest Forms for authors and Clinical Orthopaedics and Related Research® editors and board members are on file with the publication and can be viewed on request.Ethical approval for this study was obtained from the institutional review board of the Portland VA Medical Center for chart review (IRB approval number: 4499).This work was performed at the Portland VA Medical Center, Portland, OR, USA.The contents do not represent the views of the U.S. Department of Veterans Affairs or the United States Government.
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0009-921X
1528-1132
1528-1132
DOI:10.1097/CORR.0000000000003270