Combating the Small Sample Class Imbalance Problem Using Feature Selection

The class imbalance problem is encountered in real-world applications of machine learning and results in a classifier's suboptimal performance. Researchers have rigorously studied the resampling, algorithms, and feature selection approaches to this problem. No systematic studies have been condu...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on knowledge and data engineering Vol. 22; no. 10; pp. 1388 - 1400
Main Authors	Wasikowski, M, Xue-wen Chen
Format	Journal Article
Language	English
Published	New York, NY IEEE 01.10.2010 IEEE Computer Society The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Accuracy Algorithms Applied sciences Artificial intelligence Assessments bioinformatics Biological and medical sciences Class imbalance problem Classification Computer science; control theory; systems Exact sciences and technology feature evaluation and selection Fundamental and applied biological sciences. Psychology General aspects Information systems. Data bases Learning Machine learning Machine learning algorithms Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Memory organisation. Data processing Partial response channels Pattern recognition Performance evaluation Receivers Signal to noise ratio Software Speech and sound recognition and synthesis. Linguistics Studies Support vector machine classification Support vector machines Text categorization text mining Texts Training data Correlation coefficient feature evaluation and selection War Information retrieval Text Pattern recognition Data mining machine learning Selection problem Remolded sample Classification Class imbalance problem Receiver operating characteristic curves Natural language Metric text mining Bioinformatics Artificial intelligence Skewed distribution Content analysis
Online Access	Get full text
ISSN	1041-4347 1558-2191
DOI	10.1109/TKDE.2009.187

Cover

More Information
Summary:	The class imbalance problem is encountered in real-world applications of machine learning and results in a classifier's suboptimal performance. Researchers have rigorously studied the resampling, algorithms, and feature selection approaches to this problem. No systematic studies have been conducted to understand how well these methods combat the class imbalance problem and which of these methods best manage the different challenges posed by imbalanced data sets. In particular, feature selection has rarely been studied outside of text classification problems. Additionally, no studies have looked at the additional problem of learning from small samples. This paper presents a first systematic comparison of the three types of methods developed for imbalanced data classification problems and of seven feature selection metrics evaluated on small sample data sets from different applications. We evaluated the performance of these metrics using area under the receiver operating characteristic (AUC) and area under the precision-recall curve (PRC). We compared each metric on the average performance across all problems and on the likelihood of a metric yielding the best performance on a specific problem. We examined the performance of these metrics inside each problem domain. Finally, we evaluated the efficacy of these metrics to see which perform best across algorithms. Our results showed that signal-to-noise correlation coefficient (S2N) and Feature Assessment by Sliding Thresholds (FAST) are great candidates for feature selection in most applications, especially when selecting very small numbers of features.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 ObjectType-Article-2 ObjectType-Feature-1 content type line 23
ISSN:	1041-4347 1558-2191
DOI:	10.1109/TKDE.2009.187