The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics

Training and testing of conventional machine learning models on binary classification problems depend on the proportions of the two outcomes in the relevant data sets. This may be especially important in practical terms when real-world applications of the classifier are either highly imbalanced or o...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 8; no. 7; p. e67863
Main Authors	Wei, Qiong, Dunbrack, Roland L.
Format	Journal Article
Language	English
Published	United States Public Library of Science 09.07.2013 Public Library of Science (PLoS)
Subjects	Accuracy Algorithms Analysis Animals Artificial Intelligence Bioinformatics Biology Cancer Classifiers Computational Biology - methods Computer Science Correlation coefficient Correlation coefficients Data points Databases, Genetic Datasets Genetic Association Studies Genomes Genomics Genotype & phenotype Humans Learning algorithms Machine learning Mathematical models Medical research Missense mutation Models, Biological Mutation Mutation, Missense Oversampling Phenotype Polymorphism, Genetic Proteins Reproducibility of Results Social and Behavioral Sciences Teaching methods Training United States > US Pennsylvania
Online Access	Get full text
ISSN	1932-6203 1932-6203
DOI	10.1371/journal.pone.0067863

Cover

More Information
Summary:	Training and testing of conventional machine learning models on binary classification problems depend on the proportions of the two outcomes in the relevant data sets. This may be especially important in practical terms when real-world applications of the classifier are either highly imbalanced or occur in unknown proportions. Intuitively, it may seem sensible to train machine learning models on data similar to the target data in terms of proportions of the two binary outcomes. However, we show that this is not the case using the example of prediction of deleterious and neutral phenotypes of human missense mutations in human genome data, for which the proportion of the binary outcome is unknown. Our results indicate that using balanced training data (50% neutral and 50% deleterious) results in the highest balanced accuracy (the average of True Positive Rate and True Negative Rate), Matthews correlation coefficient, and area under ROC curves, no matter what the proportions of the two phenotypes are in the testing data. Besides balancing the data by undersampling the majority class, other techniques in machine learning include oversampling the minority class, interpolating minority-class data points and various penalties for misclassifying the minority class. However, these techniques are not commonly used in either the missense phenotype prediction problem or in the prediction of disordered residues in proteins, where the imbalance problem is substantial. The appropriate approach depends on the amount of available data and the specific problem at hand.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: I have read the journal’s policy and have the following conflicts. I (Roland Dunbrack) have previously served as a guest editor for PLOS ONE. This does not alter our adherence to all the PLOS ONE policies on sharing data and materials. Conceived and designed the experiments: QW RLD. Performed the experiments: QW. Analyzed the data: QW RLD. Contributed reagents/materials/analysis tools: QW. Wrote the paper: QW RLD.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0067863