A mixture model and EM-based algorithm for class discovery, robust classification, and outlier rejection in mixed labeled/unlabeled data sets

Several authors have shown that, when labeled data are scarce, improved classifiers can be built by augmenting the training set with a large set of unlabeled examples and then performing suitable learning. These works assume each unlabeled sample originates from one of the (known) classes. Here, we...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on pattern analysis and machine intelligence Vol. 25; no. 11; pp. 1468 - 1483
Main Authors	Miller, D.J., Browning, J.
Format	Journal Article
Language	English
Published	New York IEEE 01.11.2003 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Classification Classifiers Labeling Labels Learning Machine learning Mathematical analysis Rejection Robustness Stochastic processes Studies Text categorization Vectors (mathematics)
Online Access	Get full text
ISSN	0162-8828 1939-3539
DOI	10.1109/TPAMI.2003.1240120

Cover

More Information
Summary:	Several authors have shown that, when labeled data are scarce, improved classifiers can be built by augmenting the training set with a large set of unlabeled examples and then performing suitable learning. These works assume each unlabeled sample originates from one of the (known) classes. Here, we assume each unlabeled sample comes from either a known or from a heretofore undiscovered class. We propose a novel mixture model which treats as observed data not only the feature vector and the class label, but also the fact of label presence/absence for each sample. Two types of mixture components are posited. "Predefined" components generate data from known classes and assume class labels are missing at random. "Nonpredefined" components only generate unlabeled data-i.e., they capture exclusively unlabeled subsets, consistent with an outlier distribution or new classes. The predefined/nonpredefined natures are data-driven, learned along with the other parameters via an extension of the EM algorithm. Our modeling framework addresses problems involving both the known,and unknown classes: (1) robust classifier design, (2) classification with rejections, and (3) identification of the unlabeled samples (and their components) from unknown classes. Case 3 is a step toward new class discovery. Experiments are reported for each application, including topic discovery for the Reuters domain. Experiments also demonstrate the value of label presence/absence data in learning accurate mixtures.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 ObjectType-Article-2 ObjectType-Feature-1 content type line 23
ISSN:	0162-8828 1939-3539
DOI:	10.1109/TPAMI.2003.1240120