Convolutional neural network based on SMILES representation of compounds for detecting chemical motif

Background Previous studies have suggested deep learning to be a highly effective approach for screening lead compounds for new drugs. Several deep learning models have been developed by addressing the use of various kinds of fingerprints and graph convolution architectures. However, these methods a...

Full description

Saved in:

Bibliographic Details
Published in	BMC bioinformatics Vol. 19; no. Suppl 19; pp. 526 - 94
Main Authors	Hirohara, Maya, Saito, Yutaka, Koda, Yuki, Sato, Kengo, Sakakibara, Yasubumi
Format	Journal Article
Language	English
Published	London BioMed Central 31.12.2018 BioMed Central Ltd Springer Nature B.V BMC
Subjects	Algorithms Analytical chemistry Artificial intelligence Artificial neural networks Binding sites Bioinformatics Biomedical and Life Sciences Chemical compound Chemical compounds Chirality Computational Biology/Bioinformatics Computer Appl. in Life Sciences Convolution Convolutional neural network Deoxyribonucleic acid DNA Drug discovery Fingerprints Functional groups Genomes Information processing Lead compounds Life Sciences Machine learning Methods Microarrays Multivariate analysis Neural networks Organic chemistry Performance evaluation Proteins Representations SMILES Software Source code TOX 21 Challenge Virtual screening Japan Virtual screening SMILES Convolutional neural network Chemical compound TOX 21 Challenge
Online Access	Get full text
ISSN	1471-2105 1471-2105
DOI	10.1186/s12859-018-2523-5

Cover

More Information
Summary:	Background Previous studies have suggested deep learning to be a highly effective approach for screening lead compounds for new drugs. Several deep learning models have been developed by addressing the use of various kinds of fingerprints and graph convolution architectures. However, these methods are either advantageous or disadvantageous depending on whether they (1) can distinguish structural differences including chirality of compounds, and (2) can automatically discover effective features. Results We developed another deep learning model for compound classification. In this method, we constructed a distributed representation of compounds based on the SMILES notation, which linearly represents a compound structure, and applied the SMILES-based representation to a convolutional neural network (CNN). The use of SMILES allows us to process all types of compounds while incorporating a broad range of structure information, and representation learning by CNN automatically acquires a low-dimensional representation of input features. In a benchmark experiment using the TOX 21 dataset, our method outperformed conventional fingerprint methods, and performed comparably against the winning model of the TOX 21 Challenge. Multivariate analysis confirmed that the chemical space consisting of the features learned by SMILES-based representation learning adequately expressed a richer feature space that enabled the accurate discrimination of compounds. Using motif detection with the learned filters, not only important known structures (motifs) such as protein-binding sites but also structures of unknown functional groups were detected. Conclusions The source code of our SMILES-based convolutional neural network software in the deep learning framework Chainer is available at http://www.dna.bio.keio.ac.jp/smiles/ , and the dataset used for performance evaluation in this work is available at the same URL.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1471-2105 1471-2105
DOI:	10.1186/s12859-018-2523-5