Large-Scale Learning of Structure−Activity Relationships Using a Linear Support Vector Machine and Problem-Specific Metrics

The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support...

Full description

Saved in:

Bibliographic Details
Published in	Journal of chemical information and modeling Vol. 51; no. 2; pp. 203 - 213
Main Authors	Hinselmann, Georg, Rosenbaum, Lars, Jahn, Andreas, Fechner, Nikolas, Ostermann, Claude, Zell, Andreas
Format	Journal Article
Language	English
Published	Washington, DC American Chemical Society 28.02.2011
Subjects	Algorithmics. Computability. Computer arithmetics Analytical chemistry Applied sciences Artificial Intelligence Benchmarks Biological and medical sciences Chemical Information Chemistry Cluster analysis Computational Biology - methods Computer science; control theory; systems Data processing. List processing. Character string processing Databases, Factual Drug Evaluation, Preclinical - methods Exact sciences and technology General and physical chemistry General pharmacology General. Nomenclature, chemical documentation, computer chemistry Medical sciences Memory organisation. Data processing Models, Molecular Molecular Conformation Pharmaceutical technology. Pharmaceutical industry Pharmacology. Drug treatments Reproducibility of Results Software Structure-Activity Relationship Studies Theoretical computing Theory of reactions, general kinetics. Catalysis. Nomenclature, chemical documentation, computer chemistry Time Factors User-Computer Interface Virtualization High throughput screening Bayes estimation Performance evaluation High performance Statistical analysis Tree(graph) Linear machine Large scale structure Virtual screening Very large databases Random decision forests System with n degrees of freedom Computational chemistry Structure activity relation Vector support machine Linear complexity Metric Large scale Non linear effect Bayes decision Artificial intelligence Comparative study Turbulence structure Binary classification
Online Access	Get full text
ISSN	1549-9596 1549-960X 1549-960X
DOI	10.1021/ci100073w

Cover

More Information
Summary:	The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Naïve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches.
Bibliography:	SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23
ISSN:	1549-9596 1549-960X 1549-960X
DOI:	10.1021/ci100073w