A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

Background Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarr...

Full description

Saved in:

Bibliographic Details
Published in	BMC bioinformatics Vol. 9; no. 1; p. 319
Main Authors	Statnikov, Alexander, Wang, Lily, Aliferis, Constantin F
Format	Journal Article
Language	English
Published	London BioMed Central 22.07.2008 BioMed Central Ltd BMC
Subjects	Algorithms Artificial Intelligence Bioinformatics Biomarkers, Tumor - analysis Biomedical and Life Sciences Cancer Computational Biology - methods Computational Biology/Bioinformatics Computer Appl. in Life Sciences Databases, Genetic Decision Trees Diagnosis DNA microarrays Gene expression Gene Expression Profiling - methods Humans Life Sciences Methods Microarrays Neoplasms - genetics Oligonucleotide Array Sequence Analysis - methods Pattern Recognition, Automated - methods Physiological aspects Random Allocation Research Article Validation Studies as Topic United States Microarray Gene Expression Data Classification Performance Microarray Dataset Support Vector Machine Random Forest
Online Access	Get full text
ISSN	1471-2105 1471-2105
DOI	10.1186/1471-2105-9-319

Cover

More Information
Summary:	Background Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. Results In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. Conclusion We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
ISSN:	1471-2105 1471-2105
DOI:	10.1186/1471-2105-9-319