Construction of Cohorts of Similar Patients From Automatic Extraction of Medical Concepts: Phenotype Extraction Study

Reliable and interpretable automatic extraction of clinical phenotypes from large electronic medical record databases remains a challenge, especially in a language other than English. We aimed to provide an automated end-to-end extraction of cohorts of similar patients from electronic health records...

Full description

Saved in:

Bibliographic Details
Published in	JMIR medical informatics Vol. 10; no. 12; p. e42379
Main Authors	Gérardin, Christel, Mageau, Arthur, Mékinian, Arsène, Tannier, Xavier, Carrat, Fabrice
Format	Journal Article
Language	English
Published	Canada JMIR Publications 19.12.2022
Subjects	Algorithms Annotations Classification Codes Computer Science Data Structures and Algorithms Datasets Electronic health records Genotype & phenotype Hospitalization Lupus Natural language processing Original Paper Patients Physiology Rare diseases Scleroderma Womens health France phenotype text extraction natural language processing MeSH similar patient cohort systemic disease data extraction NLP named entity medical subject heading automated extraction algorithm automatic extraction Medical subject heading Named entity Algorithm Similar patient cohort Text extraction Phenotype Automatic extraction Systemic disease Automated extraction Natural language processing Data extraction
Online Access	Get full text
ISSN	2291-9694 2291-9694
DOI	10.2196/42379

Cover

More Information
Summary:	Reliable and interpretable automatic extraction of clinical phenotypes from large electronic medical record databases remains a challenge, especially in a language other than English. We aimed to provide an automated end-to-end extraction of cohorts of similar patients from electronic health records for systemic diseases. Our multistep algorithm includes a named-entity recognition step, a multilabel classification using medical subject headings ontology, and the computation of patient similarity. A selection of cohorts of similar patients on a priori annotated phenotypes was performed. Six phenotypes were selected for their clinical significance: P1, osteoporosis; P2, nephritis in systemic erythematosus lupus; P3, interstitial lung disease in systemic sclerosis; P4, lung infection; P5, obstetric antiphospholipid syndrome; and P6, Takayasu arteritis. We used a training set of 151 clinical notes and an independent validation set of 256 clinical notes, with annotated phenotypes, both extracted from the Assistance Publique-Hôpitaux de Paris data warehouse. We evaluated the precision of the 3 patients closest to the index patient for each phenotype with precision-at-3 and recall and average precision. For P1-P4, the precision-at-3 ranged from 0.85 (95% CI 0.75-0.95) to 0.99 (95% CI 0.98-1), the recall ranged from 0.53 (95% CI 0.50-0.55) to 0.83 (95% CI 0.81-0.84), and the average precision ranged from 0.58 (95% CI 0.54-0.62) to 0.88 (95% CI 0.85-0.90). P5-P6 phenotypes could not be analyzed due to the limited number of phenotypes. Using a method close to clinical reasoning, we built a scalable and interpretable end-to-end algorithm for extracting cohorts of similar patients.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 PMCID: PMC9808583
ISSN:	2291-9694 2291-9694
DOI:	10.2196/42379