Reporting of demographic data and representativeness in machine learning models using electronic health records

Abstract Objective The development of machine learning (ML) algorithms to address a variety of issues faced in clinical practice has increased rapidly. However, questions have arisen regarding biases in their development that can affect their applicability in specific populations. We sought to evalu...

Full description

Saved in:

Bibliographic Details
Published in	Journal of the American Medical Informatics Association : JAMIA Vol. 27; no. 12; pp. 1878 - 1884
Main Authors	Bozkurt, Selen, Cahan, Eli M, Seneviratne, Martin G, Sun, Ran, Lossio-Ventura, Juan A, Ioannidis, John P A, Hernandez-Boussard, Tina
Format	Journal Article
Language	English
Published	England Oxford University Press 09.12.2020
Subjects	Demography Electronic Health Records Ethnicity Female Humans Machine Learning Male Nutrition Surveys Research and Applications Socioeconomic Factors clinical decision support, bias, transparency demographic data machine learning electronic health record
Online Access	Get full text
ISSN	1527-974X 1067-5027 1527-974X
DOI	10.1093/jamia/ocaa164

Cover

More Information
Summary:	Abstract Objective The development of machine learning (ML) algorithms to address a variety of issues faced in clinical practice has increased rapidly. However, questions have arisen regarding biases in their development that can affect their applicability in specific populations. We sought to evaluate whether studies developing ML models from electronic health record (EHR) data report sufficient demographic data on the study populations to demonstrate representativeness and reproducibility. Materials and Methods We searched PubMed for articles applying ML models to improve clinical decision-making using EHR data. We limited our search to papers published between 2015 and 2019. Results Across the 164 studies reviewed, demographic variables were inconsistently reported and/or included as model inputs. Race/ethnicity was not reported in 64%; gender and age were not reported in 24% and 21% of studies, respectively. Socioeconomic status of the population was not reported in 92% of studies. Studies that mentioned these variables often did not report if they were included as model inputs. Few models (12%) were validated using external populations. Few studies (17%) open-sourced their code. Populations in the ML studies include higher proportions of White and Black yet fewer Hispanic subjects compared to the general US population. Discussion The demographic characteristics of study populations are poorly reported in the ML literature based on EHR data. Demographic representativeness in training data and model transparency is necessary to ensure that ML models are deployed in an equitable and reproducible manner. Wider adoption of reporting guidelines is warranted to improve representativeness and reproducibility.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1527-974X 1067-5027 1527-974X
DOI:	10.1093/jamia/ocaa164