In Silico Target Predictions: Defining a Benchmarking Data Set and Comparison of Performance of the Multiclass Naïve Bayes and Parzen-Rosenblatt Window

In this study, two probabilistic machine-learning algorithms were compared for in silico target prediction of bioactive molecules, namely the well-established Laplacian-modified Naïve Bayes classifier (NB) and the more recently introduced (to Cheminformatics) Parzen-Rosenblatt Window. Both classifie...

Full description

Saved in:

Bibliographic Details
Published in	Journal of chemical information and modeling Vol. 53; no. 8; pp. 1957 - 1966
Main Authors	Koutsoukas, Alexios, Lowe, Robert, KalantarMotamedi, Yasaman, Mussa, Hamse Y., Klaffke, Werner, Mitchell, John B. O., Glen, Robert C., Bender, Andreas
Format	Journal Article
Language	English
Published	Washington, DC American Chemical Society 26.08.2013
Subjects	Algorithms Analytical chemistry Applied sciences Artificial intelligence Bayes Theorem Benchmarking Benchmarks Biological and medical sciences Chemical compounds Chemistry Computational Biology - methods Computer science; control theory; systems Data processing. List processing. Character string processing Drug Discovery Exact sciences and technology General and physical chemistry General pharmacology General. Nomenclature, chemical documentation, computer chemistry Humans Ligands Medical sciences Memory organisation. Data processing Molecules Pharmacokinetics. Pharmacogenetics. Drug-receptor interactions Pharmacology. Drug treatments Protein Binding Proteins Proteins - metabolism Reproducibility of Results Software Theory of reactions, general kinetics. Catalysis. Nomenclature, chemical documentation, computer chemistry Bayes estimation Performance evaluation Ligand Diversity Virtual screening Very large databases Parzen classification Molecular connectivity Experimental study Modeling Biological activity Laplacian Protein Computational chemistry Structure activity relation Supervised learning Biopolymer Set covering Probability learning Resistive transition Bioinformatics Artificial intelligence Comparative study Binary classification
Online Access	Get full text
ISSN	1549-9596 1549-960X 1549-960X
DOI	10.1021/ci300435j

Cover

More Information
Summary:	In this study, two probabilistic machine-learning algorithms were compared for in silico target prediction of bioactive molecules, namely the well-established Laplacian-modified Naïve Bayes classifier (NB) and the more recently introduced (to Cheminformatics) Parzen-Rosenblatt Window. Both classifiers were trained in conjunction with circular fingerprints on a large data set of bioactive compounds extracted from ChEMBL, covering 894 human protein targets with more than 155,000 ligand-protein pairs. This data set is also provided as a benchmark data set for future target prediction methods due to its size as well as the number of bioactivity classes it contains. In addition to evaluating the methods, different performance measures were explored. This is not as straightforward as in binary classification settings, due to the number of classes, the possibility of multiple class memberships, and the need to translate model scores into "yes/no" predictions for assessing model performance. Both algorithms achieved a recall of correct targets that exceeds 80% in the top 1% of predictions. Performance depends significantly on the underlying diversity and size of a given class of bioactive compounds, with small classes and low structural similarity affecting both algorithms to different degrees. When tested on an external test set extracted from WOMBAT covering more than 500 targets by excluding all compounds with Tanimoto similarity above 0.8 to compounds from the ChEMBL data set, the current methodologies achieved a recall of 63.3% and 66.6% among the top 1% for Naïve Bayes and Parzen-Rosenblatt Window, respectively. While those numbers seem to indicate lower performance, they are also more realistic for settings where protein targets need to be established for novel chemical substances.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 ObjectType-Article-2 ObjectType-Feature-1 content type line 23
ISSN:	1549-9596 1549-960X 1549-960X
DOI:	10.1021/ci300435j