Molecular contrastive learning of representations via graph neural networks

Molecular machine learning bears promise for efficient molecular property prediction and drug discovery. However, labelled molecule data can be expensive and time consuming to acquire. Due to the limited labelled data, it is a great challenge for supervised-learning machine learning models to genera...

Full description

Saved in:

Bibliographic Details
Published in	Nature machine intelligence Vol. 4; no. 3; pp. 279 - 287
Main Authors	Wang, Yuyang, Wang, Jianren, Cao, Zhonglin, Barati Farimani, Amir
Format	Journal Article
Language	English
Published	London Nature Publishing Group UK 01.03.2022 Nature Publishing Group Springer Nature
Subjects	631/154 631/45 631/57 Benchmarks Chemical bonds Chemical compounds Coders Computer Science Engineering Graph neural networks Graph theory Graphical representations Graphs Machine learning Molecular machines Molecular structure Neural networks Training
Online Access	Get full text
ISSN	2522-5839 2522-5839
DOI	10.1038/s42256-022-00447-x

Cover

More Information
Summary:	Molecular machine learning bears promise for efficient molecular property prediction and drug discovery. However, labelled molecule data can be expensive and time consuming to acquire. Due to the limited labelled data, it is a great challenge for supervised-learning machine learning models to generalize to the giant chemical space. Here we present MolCLR (Molecular Contrastive Learning of Representations via Graph Neural Networks), a self-supervised learning framework that leverages large unlabelled data (~10 million unique molecules). In MolCLR pre-training, we build molecule graphs and develop graph-neural-network encoders to learn differentiable representations. Three molecule graph augmentations are proposed: atom masking, bond deletion and subgraph removal. A contrastive estimator maximizes the agreement of augmentations from the same molecule while minimizing the agreement of different molecules. Experiments show that our contrastive learning framework significantly improves the performance of graph-neural-network encoders on various molecular property benchmarks including both classification and regression tasks. Benefiting from pre-training on the large unlabelled database, MolCLR even achieves state of the art on several challenging benchmarks after fine-tuning. In addition, further investigations demonstrate that MolCLR learns to embed molecules into representations that can distinguish chemically reasonable molecular similarities. Molecular representations are hard to design due to the large size of the chemical space, the amount of potentially important information in a molecular structure and the relatively low number of annotated molecules. Still, the quality of these representations is vital for computational models trying to predict molecular properties. Wang et al. present a contrastive learning approach to provide differentiable representations from unlabelled data.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 AR0001221 USDOE Advanced Research Projects Agency - Energy (ARPA-E)
ISSN:	2522-5839 2522-5839
DOI:	10.1038/s42256-022-00447-x