Extraction of radiographic findings from unstructured thoracoabdominal computed tomography reports using convolutional neural network based natural language processing

Heart failure (HF) is a major cause of morbidity and mortality. However, much of the clinical data is unstructured in the form of radiology reports, while the process of data collection and curation is arduous and time-consuming. We utilized a machine learning (ML)-based natural language processing...

Full description

Saved in:
Bibliographic Details
Published inPloS one Vol. 15; no. 7; p. e0236827
Main Authors Pandey, Mohit, Xu, Zhuoran, Sholle, Evan, Maliakal, Gabriel, Singh, Gurpreet, Fatima, Zahra, Larine, Daria, Lee, Benjamin C., Wang, Jing, van Rosendael, Alexander R., Baskaran, Lohendran, Shaw, Leslee J., Min, James K., Al’Aref, Subhi J.
Format Journal Article
LanguageEnglish
Published United States Public Library of Science 30.07.2020
Public Library of Science (PLoS)
Subjects
Online AccessGet full text
ISSN1932-6203
1932-6203
DOI10.1371/journal.pone.0236827

Cover

More Information
Summary:Heart failure (HF) is a major cause of morbidity and mortality. However, much of the clinical data is unstructured in the form of radiology reports, while the process of data collection and curation is arduous and time-consuming. We utilized a machine learning (ML)-based natural language processing (NLP) approach to extract clinical terms from unstructured radiology reports. Additionally, we investigate the prognostic value of the extracted data in predicting all-cause mortality (ACM) in HF patients. This observational cohort study utilized 122,025 thoracoabdominal computed tomography (CT) reports from 11,808 HF patients obtained between 2008 and 2018. 1,560 CT reports were manually annotated for the presence or absence of 14 radiographic findings, in addition to age and gender. Thereafter, a Convolutional Neural Network (CNN) was trained, validated and tested to determine the presence or absence of these features. Further, the ability of CNN to predict ACM was evaluated using Cox regression analysis on the extracted features. 11,808 CT reports were analyzed from 11,808 patients (mean age 72.8 ± 14.8 years; 52.7% (6,217/11,808) male) from whom 3,107 died during the 10.6-year follow-up. The CNN demonstrated excellent accuracy for retrieval of the 14 radiographic findings with area-under-the-curve (AUC) ranging between 0.83-1.00 (F1 score 0.84-0.97). Cox model showed the time-dependent AUC for predicting ACM was 0.747 (95% confidence interval [CI] of 0.704-0.790) at 30 days. An ML-based NLP approach to unstructured CT reports demonstrates excellent accuracy for the extraction of predetermined radiographic findings, and provides prognostic value in HF patients.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ObjectType-Undefined-3
Current address: Department of Cardiology, Leiden University Medical Center, Leiden, The Netherlands
Current address: Cleerly, Inc, New York, New York, United States of America
Current address: Department of Cardiovascular Medicine, National Heart Centre Singapore, Singapore, Singapore
Competing Interests: The authors have declared that no competing interests exist. Gurpreet Singh is currently employed at GlaxoSmithKline but was not a part of GlaxoSmithKline during the conduct of this study. Gabriel Maliakal and James K. Min are currently employed at Cleerly Inc. but were not a part of Cleerly Inc. during the conduct of this study. Mohit Pandey is currently employed at Ipsos but was not a part of Ipsos during the conduct of this study. These commercial affiliations do not alter our adherence to PLOS ONE policies on sharing data and materials.
Current address: Ipsos US Public Affairs, New York, New York, United States of America
Current address: GlaxoSmithKline, Pennsylvania, Pennsylvania, United States of America
ISSN:1932-6203
1932-6203
DOI:10.1371/journal.pone.0236827