Estimating prediction error in microarray classification: Modifications of the 0.632+ bootstrap when ${bf n} < {bf p}

We are interested in estimating prediction error for a classification model built on high dimensional genomic data when the number of genes (p) greatly exceeds the number of subjects (n). We examine a distance argument supporting the conventional 0.632+ bootstrap proposed for the $n > p$ scenario...

Full description

Saved in:

Bibliographic Details
Published in	Canadian journal of statistics Vol. 41; no. 1; pp. 133 - 150
Main Authors	Jiang, Wenyu, Chen, Bingshu E.
Format	Journal Article
Language	English
Published	Hoboken, USA John Wiley & Sons, Inc 01.03.2013
Subjects	0.632+ bootstrap Bootstrap class prediction cross-validation feature selection learning curve microarray data MSC 2010: Primary 62G09 prediction error secondary 62P10
Online Access	Get full text
ISSN	0319-5724 1708-945X
DOI	10.1002/cjs.11158

Cover

More Information
Summary:	We are interested in estimating prediction error for a classification model built on high dimensional genomic data when the number of genes (p) greatly exceeds the number of subjects (n). We examine a distance argument supporting the conventional 0.632+ bootstrap proposed for the $n > p$ scenario, modify it for the $n < p$ situation and develop learning curves to describe how the true prediction error varies with the number of subjects in the training set. The curves are then applied to define adjusted resampling estimates for the prediction error in order to achieve a balance in terms of bias and variability. The adjusted resampling methods are proposed as counterparts of the 0.632+ bootstrap when $n < p$, and are found to improve on the 0.632+ bootstrap and other existing methods in the microarray study scenario when the sample size is small and there is some level of differential expression. The Canadian Journal of Statistics 41: 133–150; 2013 © 2012 Statistical Society of Canada Nous sommes intéressés à estimer l'erreur de prédiction pour un modèle de classification basé sur des données génomiques de grande dimension lorsque le nombre de gènes (p) dépasse largement le nombre de sujets (n). Nous examinons un argument de distance appuyant la méthode de rééchantillonnage .632+ proposé pour le scénario $n > p$, et nous la modifions pour le cas $n < p$. De plus, nous développons les courbes d'apprentissage pour décrire comment la vraie erreur de prédictions varie en fonction du nombre de sujets dans l'échantillon de travail. Ces courbes sont alors utilisées pour définir les estimations de rééchantillonnage ajustés pour l'erreur de prédiction de façon à obtenir un compromis entre le biais et la variabilité. Les méthodes de rééchantillonnage ajustées sont proposées en contrepartie de la méthode .632+ lorsque $n < p$. De plus, celles‐ci et les autres méthodes existantes dans les études de puces à ADN sont améliorées lorsque la taille échantillonnale est petite et qu'il y a quelques niveaux dans l'expression différentielle. La revue canadienne de statistique 41: 133–150; 2013 © 2012 Société statistique du Canada
Bibliography:	istex:A62BF7D8FD25D73AFE1C0FC5D8B12FFE96AE5076 ark:/67375/WNG-9ZXL16VK-N ArticleID:CJS11158
ISSN:	0319-5724 1708-945X
DOI:	10.1002/cjs.11158