Exploiting the ensemble paradigm for stable feature selection: A case study on high-dimensional genomic data

•We discuss the rationale of ensemble feature selection.•We empirically evaluate the effectiveness of a data perturbation ensemble approach.•Our study involves both univariate and multivariate selection algorithms.•A special emphasis is given to the stability level of the selected feature subsets.•U...

Full description

Saved in:
Bibliographic Details
Published inInformation fusion Vol. 35; pp. 132 - 147
Main Authors Pes, Barbara, Dessì, Nicoletta, Angioni, Marta
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.05.2017
Subjects
Online AccessGet full text
ISSN1566-2535
1872-6305
1872-6305
DOI10.1016/j.inffus.2016.10.001

Cover

More Information
Summary:•We discuss the rationale of ensemble feature selection.•We empirically evaluate the effectiveness of a data perturbation ensemble approach.•Our study involves both univariate and multivariate selection algorithms.•A special emphasis is given to the stability level of the selected feature subsets.•Useful insight is gained from the analysis of high-dimensional genomic datasets. Ensemble classification is a well-established approach that involves fusing the decisions of multiple predictive models. A similar “ensemble logic” has been recently applied to challenging feature selection tasks aimed at identifying the most informative variables (or features) for a given domain of interest. In this work, we discuss the rationale of ensemble feature selection and evaluate the effects and the implications of a specific ensemble approach, namely the data perturbation strategy. Basically, it consists in combining multiple selectors that exploit the same core algorithm but are trained on different perturbed versions of the original data. The real potential of this approach, still object of debate in the feature selection literature, is here investigated in conjunction with different kinds of core selection algorithms (both univariate and multivariate). In particular, we evaluate the extent to which the ensemble implementation improves the overall performance of the selection process, in terms of predictive accuracy and stability (i.e., robustness with respect to changes in the training data). Furthermore, we measure the impact of the ensemble approach on the final selection outcome, i.e. on the composition of the selected feature subsets. The results obtained on ten public genomic benchmarks provide useful insight on both the benefits and the limitations of such ensemble approach, paving the way to the exploration of new and wider ensemble schemes.
ISSN:1566-2535
1872-6305
1872-6305
DOI:10.1016/j.inffus.2016.10.001