Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data

Motivation: Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimat...

Full description

Saved in:

Bibliographic Details
Published in	Bioinformatics Vol. 21; no. 10; pp. 2417 - 2423
Main Authors	Sehgal, Muhammad Shoaib B., Gondal, Iqbal, Dooley, Laurence S.
Format	Journal Article
Language	English
Published	Oxford Oxford University Press 15.05.2005 Oxford Publishing Limited (England)
Subjects	Algorithms Biological analysis Biological and medical sciences BRCA1 Protein - genetics BRCA1 Protein - metabolism BRCA2 Protein - genetics BRCA2 Protein - metabolism Data Interpretation, Statistical Female Fundamental and applied biological sciences. Psychology Gene Expression Profiling - methods General aspects Humans Likelihood Functions Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Models, Biological Models, Statistical Oligonucleotide Array Sequence Analysis - methods Ovarian cancer Ovarian Neoplasms - genetics Ovarian Neoplasms - metabolism Principal components analysis Saccharomyces cerevisiae Proteins - genetics Saccharomyces cerevisiae Proteins - metabolism Sample Size Statistical analysis Time series Yeasts Data analysis Statistical analysis Yeast Multiple Number Estimation Use Biology Malignant tumor Microarray Learning algorithm
Online Access	Get full text
ISSN	1367-4803 0266-7061 1460-2059 1460-2059 1367-4811
DOI	10.1093/bioinformatics/bti345

Cover

More Information
Summary:	Motivation: Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible before using these algorithms. While many imputation algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accurately undertaken. In this paper, an innovative missing value imputation algorithm called collateral missing value estimation (CMVE) is presented which uses multiple covariance-based imputation matrices for the final prediction of missing values. The matrices are computed and optimized using least square regression and linear programming methods. Results: The new CMVE algorithm has been compared with existing estimation techniques including Bayesian principal component analysis imputation (BPCA), least square impute (LSImpute) and K-nearest neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the normalized root mean square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken on the yeast dataset, which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing values. The results confirmed that CMVE consistently demonstrated superior and robust estimation capability of missing values compared with other methods for both series types of data, for the same order of computational complexity. A concise theoretical framework has also been formulated to validate the improved performance of the CMVE algorithm. Availability: The CMVE software is available upon request from the authors. Contact: Shoaib.Sehgal@infotech.monash.edu.au
Bibliography:	istex:1EAF7A8FAF25A30730DE9CA06094D56E1822544A To whom correspondence should be addressed. local:bti345 ark:/67375/HXZ-03DHMWCH-Q ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 ObjectType-Article-2 ObjectType-Feature-1 content type line 23 ObjectType-Undefined-3
ISSN:	1367-4803 0266-7061 1460-2059 1460-2059 1367-4811
DOI:	10.1093/bioinformatics/bti345