Identifying primary and recurrent cancers using a SAS-based natural language processing algorithm

Significant limitations exist in the timely and complete identification of primary and recurrent cancers for clinical and epidemiologic research. A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to address this problem. SCENT employs hierarchical classification rules to id...

Full description

Saved in:
Bibliographic Details
Published inJournal of the American Medical Informatics Association : JAMIA Vol. 20; no. 2; pp. 349 - 355
Main Authors Strauss, Justin A, Chao, Chun R, Kwan, Marilyn L, Ahmed, Syed A, Schottinger, Joanne E, Quinn, Virginia P
Format Journal Article
LanguageEnglish
Published England BMJ Group 01.03.2013
Subjects
Online AccessGet full text
ISSN1067-5027
1527-974X
1527-974X
DOI10.1136/amiajnl-2012-000928

Cover

More Information
Summary:Significant limitations exist in the timely and complete identification of primary and recurrent cancers for clinical and epidemiologic research. A SAS-based coding, extraction, and nomenclature tool (SCENT) was developed to address this problem. SCENT employs hierarchical classification rules to identify and extract information from electronic pathology reports. Reports are analyzed and coded using a dictionary of clinical concepts and associated SNOMED codes. To assess the accuracy of SCENT, validation was conducted using manual review of pathology reports from a random sample of 400 breast and 400 prostate cancer patients diagnosed at Kaiser Permanente Southern California. Trained abstractors classified the malignancy status of each report. Classifications of SCENT were highly concordant with those of abstractors, achieving κ of 0.96 and 0.95 in the breast and prostate cancer groups, respectively. SCENT identified 51 of 54 new primary and 60 of 61 recurrent cancer cases across both groups, with only three false positives in 792 true benign cases. Measures of sensitivity, specificity, positive predictive value, and negative predictive value exceeded 94% in both cancer groups. Favorable validation results suggest that SCENT can be used to identify, extract, and code information from pathology report text. Consequently, SCENT has wide applicability in research and clinical care. Further assessment will be needed to validate performance with other clinical text sources, particularly those with greater linguistic variability. SCENT is proof of concept for SAS-based natural language processing applications that can be easily shared between institutions and used to support clinical and epidemiologic research.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ObjectType-Undefined-3
ObjectType-Article-2
ObjectType-Feature-1
The HMO Cancer Research Network (CRN) consists of the research programs, enrollee populations and databases of 14 HMO members of the HMO Research Network. The overall goal of the CRN is to conduct collaborative research to determine the effectiveness of preventive, curative, and supportive interventions for major cancers that span the natural history of those cancers among diverse populations and health systems. The 14 health plans, with nearly 11 million enrollees, are distinguished by their longstanding commitment to prevention and research, and collaboration among themselves and with affiliated academic institutions.
ISSN:1067-5027
1527-974X
1527-974X
DOI:10.1136/amiajnl-2012-000928