A text-mining approach to obtain detailed treatment information from free-text fields in population-based cancer registries: A study of non-small cell lung cancer in California

Population-based cancer registries have treatment information for all patients making them an excellent resource for population-level monitoring. However, specific treatment details, such as drug names, are contained in a free-text format that is difficult to process and summarize. We assessed the a...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 14; no. 2; p. e0212454
Main Authors	Maguire, Frances B., Morris, Cyllene R., Parikh-Patel, Arti, Cress, Rosemary D., Keegan, Theresa H. M., Li, Chin-Shang, Lin, Patrick S., Kizer, Kenneth W.
Format	Journal Article
Language	English
Published	United States Public Library of Science 22.02.2019 Public Library of Science (PLoS)
Subjects	Algorithms Analysis Antineoplastic Agents - therapeutic use Artificial intelligence Bevacizumab California Cancer Cancer research Cancer therapies Cancer treatment Carcinoma, Non-Small-Cell Lung - drug therapy Care and treatment Chemotherapy Chronic illnesses Computer and Information Sciences Data Collection - statistics & numerical data Data mining Data Mining - methods Data Mining - statistics & numerical data Data processing Dosage and administration Electronic health records Electronic Health Records - statistics & numerical data Epidemiology Female Health sciences Hematology Humans Identification Informatics Lung cancer Lung diseases Lung Neoplasms - drug therapy Male Medical diagnosis Medical records Medicine Medicine and Health Sciences Mining industry Natural language processing Non-small cell lung cancer Non-small cell lung carcinoma Nursing schools Oncology Patients Pemetrexed Perl Physical Sciences Population Population-based studies Public health Registries - statistics & numerical data Research and Analysis Methods Respiratory system agents Small cell lung cancer Software Studies California Sacramento California United States > US
Online Access	Get full text
ISSN	1932-6203 1932-6203
DOI	10.1371/journal.pone.0212454

Cover

More Information
Summary:	Population-based cancer registries have treatment information for all patients making them an excellent resource for population-level monitoring. However, specific treatment details, such as drug names, are contained in a free-text format that is difficult to process and summarize. We assessed the accuracy and efficiency of a text-mining algorithm to identify systemic treatments for lung cancer from free-text fields in the California Cancer Registry. The algorithm used Perl regular expressions in SAS 9.4 to search for treatments in 24,845 free-text records associated with 17,310 patients in California diagnosed with stage IV non-small cell lung cancer between 2012 and 2014. Our algorithm categorized treatments into six groups that align with National Comprehensive Cancer Network guidelines. We compared results to a manual review (gold standard) of the same records. Percent agreement ranged from 91.1% to 99.4%. Ranges for other measures were 0.71-0.92 (Kappa), 74.3%-97.3% (sensitivity), 92.4%-99.8% (specificity), 60.4%-96.4% (positive predictive value), and 92.9%-99.9% (negative predictive value). The text-mining algorithm used one-sixth of the time required for manual review. SAS-based text mining of free-text data can accurately detect systemic treatments administered to patients and save considerable time compared to manual review, maximizing the utility of the extant information in population-based cancer registries for comparative effectiveness research.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: The authors have declared that no competing interests exist. These authors also contributed equally to this work.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0212454