Is cross-validation valid for small-sample microarray classification?

Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable unders...

Full description

Saved in:

Bibliographic Details
Published in	Bioinformatics Vol. 20; no. 3; pp. 374 - 380
Main Authors	Braga-Neto, Ulisses M., Dougherty, Edward R.
Format	Journal Article
Language	English
Published	Oxford Oxford University Press 12.02.2004 Oxford Publishing Limited (England)
Subjects	Algorithms Benchmarking - methods Biological and medical sciences Breast Neoplasms - diagnosis Breast Neoplasms - genetics Classification Computer Simulation Fundamental and applied biological sciences. Psychology Gene Expression Profiling - methods General aspects Genetic Predisposition to Disease - genetics Genetic Testing - methods Humans Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Models, Genetic Models, Statistical Oligonucleotide Array Sequence Analysis - methods Pattern Recognition, Automated Reproducibility of Results Sample Size Sensitivity and Specificity Design Human Estimation error Error estimation Simulation Classification Bootstrap Small sample Malignant tumor Result
Online Access	Get full text
ISSN	1367-4803 1367-4811 1460-2059 1367-4811
DOI	10.1093/bioinformatics/btg419

Cover

More Information
Summary:	Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples. Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules—linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)—using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution). Availability and Supplementary information: A companion web site can be accessed at the URL http://ee.tamu.edu/~edward/cv_paper. The companion web site contains: (1) the complete set of tables and plots regarding the simulation study; (2) additional figures; (3) a compilation of references for microarray classification studies and (4) the source code used, with full documentation and examples.
Bibliography:	ark:/67375/HXZ-7QTWG52M-X local:btg419 istex:01F7546A1EF906920B59ED0F2D835FC1ABE778A5 Contact: edward@ee.tamu.edu ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 ObjectType-Article-2 ObjectType-Feature-1 content type line 23 ObjectType-Undefined-3
ISSN:	1367-4803 1367-4811 1460-2059 1367-4811
DOI:	10.1093/bioinformatics/btg419