A comparison of various imputation algorithms for missing data

Many datasets in medicine and other branches of science are incomplete. In this article we compare various imputation algorithms for missing data. We take the point of view that it has already been decided that the imputation should be carried out using multiple imputation by chained equation and th...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 20; no. 5; p. e0319784
Main Authors	Kampf, Jürgen, Dykun, Iryna, Rassaf, Tienush, Mahabadi, Amir Abbas
Format	Journal Article
Language	English
Published	United States Public Library of Science 12.05.2025 Public Library of Science (PLoS)
Subjects	Algorithms Biology and Life Sciences Blood pressure Branches Cardiovascular disease Cardiovascular diseases Comparative analysis Computation Computer and Information Sciences Confidence intervals Coronary artery disease Data Interpretation, Statistical Datasets Decision trees Diabetes mellitus Engineering and Technology Genetics Heart diseases Humans Logistic Models Low density lipoprotein Matching Medicine and Health Sciences Methods Missing data Missing observations (Statistics) Multiple imputation (Statistics) Physical Sciences Regression analysis Regression models Research and Analysis Methods Simulation Software packages Statistical analysis Statistics Subroutines Variables Germany
Online Access	Get full text
ISSN	1932-6203 1932-6203
DOI	10.1371/journal.pone.0319784

Cover

More Information
Summary:	Many datasets in medicine and other branches of science are incomplete. In this article we compare various imputation algorithms for missing data. We take the point of view that it has already been decided that the imputation should be carried out using multiple imputation by chained equation and the only decision left is that of a subroutine for the one-dimensional imputations. The subroutines to be compared are predictive mean matching, weighted predictive mean matching, sampling, classification or regression trees and random forests. We compare these subroutines on real data and on simulated data. We consider the estimation of expected values, variances and coefficients of linear regression models, logistic regression models and Cox regression models. As real data we use data of the survival times after the diagnosis of an obstructive coronary artery disease with systolic blood pressure, LDL, diabetes, smoking behavior and family history of premature heart diseases as variables for which values have to be imputed. While we are mainly interested in statistical properties like biases, mean squared errors or coverage probabilities of confidence intervals, we also have an eye on the computation time. Weighted predictive mean matching had to be excluded from the statistical comparison due to its enormous computation time. Among the remaining algorithms, in most situations we tested, predictive mean matching performed best. This is by far the largest comparison study for subroutines of multiple imputation by chained equations that has been performed up to now.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: Jürgen Kampf and Iryna Dykun declare no conflict of interest. Tienush Rassaf received honoraria, lecture fees, and grant support from Edwards Lifesciences, AstraZeneca, Bayer, Novartis, Berlin Chemie, Daiicho-Sankyo, Boehringer Ingelheim, Novo Nordisk, Cardiac Dimensions, and Pfizer, all unrelated to this work. Amir Mahabadi received honoraria, lecture fees, and/or grant support from Amgen, Daiichi-Sankyo, Edwards Lifesciences, Novartis, Sanofi, all unrelated to this work. Tienush Rassaf and Amir Mahabadi are co-founders of Mycor GmbH, a company focusing on the development of AI-based ECG-algorithms. This does not alter our adherence to PLOS ONE policies on sharing data and materials. There are no patents, products in development or marketed products associated with this research to declare.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0319784