Is cross-validation valid for small-sample microarray classification?

Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable unders...

Full description

Saved in:
Bibliographic Details
Published inBioinformatics Vol. 20; no. 3; pp. 374 - 380
Main Authors Braga-Neto, Ulisses M., Dougherty, Edward R.
Format Journal Article
LanguageEnglish
Published Oxford Oxford University Press 12.02.2004
Oxford Publishing Limited (England)
Subjects
Online AccessGet full text
ISSN1367-4803
1367-4811
1460-2059
1367-4811
DOI10.1093/bioinformatics/btg419

Cover

More Information
Summary:Motivation: Microarray classification typically possesses two striking attributes: (1) classifier design and error estimation are based on remarkably small samples and (2) cross-validation error estimation is employed in the majority of the papers. Thus, it is necessary to have a quantifiable understanding of the behavior of cross-validation in the context of very small samples. Results: An extensive simulation study has been performed comparing cross-validation, resubstitution and bootstrap estimation for three popular classification rules—linear discriminant analysis, 3-nearest-neighbor and decision trees (CART)—using both synthetic and real breast-cancer patient data. Comparison is via the distribution of differences between the estimated and true errors. Various statistics for the deviation distribution have been computed: mean (for estimator bias), variance (for estimator precision), root-mean square error (for composition of bias and variance) and quartile ranges, including outlier behavior. In general, while cross-validation error estimation is much less biased than resubstitution, it displays excessive variance, which makes individual estimates unreliable for small samples. Bootstrap methods provide improved performance relative to variance, but at a high computational cost and often with increased bias (albeit, much less than with resubstitution). Availability and Supplementary information: A companion web site can be accessed at the URL http://ee.tamu.edu/~edward/cv_paper. The companion web site contains: (1) the complete set of tables and plots regarding the simulation study; (2) additional figures; (3) a compilation of references for microarray classification studies and (4) the source code used, with full documentation and examples.
Bibliography:ark:/67375/HXZ-7QTWG52M-X
local:btg419
istex:01F7546A1EF906920B59ED0F2D835FC1ABE778A5
Contact: edward@ee.tamu.edu
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ObjectType-Article-2
ObjectType-Feature-1
content type line 23
ObjectType-Undefined-3
ISSN:1367-4803
1367-4811
1460-2059
1367-4811
DOI:10.1093/bioinformatics/btg419