Molecular contrastive learning of representations via graph neural networks
Molecular machine learning bears promise for efficient molecular property prediction and drug discovery. However, labelled molecule data can be expensive and time consuming to acquire. Due to the limited labelled data, it is a great challenge for supervised-learning machine learning models to genera...
Saved in:
Published in | Nature machine intelligence Vol. 4; no. 3; pp. 279 - 287 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
London
Nature Publishing Group UK
01.03.2022
Nature Publishing Group Springer Nature |
Subjects | |
Online Access | Get full text |
ISSN | 2522-5839 2522-5839 |
DOI | 10.1038/s42256-022-00447-x |
Cover
Summary: | Molecular machine learning bears promise for efficient molecular property prediction and drug discovery. However, labelled molecule data can be expensive and time consuming to acquire. Due to the limited labelled data, it is a great challenge for supervised-learning machine learning models to generalize to the giant chemical space. Here we present MolCLR (Molecular Contrastive Learning of Representations via Graph Neural Networks), a self-supervised learning framework that leverages large unlabelled data (~10 million unique molecules). In MolCLR pre-training, we build molecule graphs and develop graph-neural-network encoders to learn differentiable representations. Three molecule graph augmentations are proposed: atom masking, bond deletion and subgraph removal. A contrastive estimator maximizes the agreement of augmentations from the same molecule while minimizing the agreement of different molecules. Experiments show that our contrastive learning framework significantly improves the performance of graph-neural-network encoders on various molecular property benchmarks including both classification and regression tasks. Benefiting from pre-training on the large unlabelled database, MolCLR even achieves state of the art on several challenging benchmarks after fine-tuning. In addition, further investigations demonstrate that MolCLR learns to embed molecules into representations that can distinguish chemically reasonable molecular similarities.
Molecular representations are hard to design due to the large size of the chemical space, the amount of potentially important information in a molecular structure and the relatively low number of annotated molecules. Still, the quality of these representations is vital for computational models trying to predict molecular properties. Wang et al. present a contrastive learning approach to provide differentiable representations from unlabelled data. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 AR0001221 USDOE Advanced Research Projects Agency - Energy (ARPA-E) |
ISSN: | 2522-5839 2522-5839 |
DOI: | 10.1038/s42256-022-00447-x |