COLLAPSE: A representation learning framework for identification and characterization of protein structural sites

The identification and characterization of the structural sites which contribute to protein function are crucial for understanding biological mechanisms, evaluating disease risk, and developing targeted therapies. However, the quantity of known protein structures is rapidly outpacing our ability to...

Full description

Saved in:

Bibliographic Details
Published in	Protein science Vol. 32; no. 2; pp. e4541 - n/a
Main Authors	Derry, Alexander, Altman, Russ B.
Format	Journal Article
Language	English
Published	Hoboken, USA John Wiley & Sons, Inc 01.02.2023 Wiley Subscription Services, Inc
Subjects	Benchmarks Collapse Computer applications Datasets deep learning functional site annotation Health risks Learning Mutation Protein Conformation Protein interaction protein structure analysis Proteins Proteins - chemistry representation learning Representations Software Structural analysis structural informatics Structure-function relationships Tools for Protein Science Transfer learning functional site annotation deep learning protein structure analysis structural informatics representation learning
Online Access	Get full text
ISSN	0961-8368 1469-896X 1469-896X
DOI	10.1002/pro.4541

Cover

More Information
Summary:	The identification and characterization of the structural sites which contribute to protein function are crucial for understanding biological mechanisms, evaluating disease risk, and developing targeted therapies. However, the quantity of known protein structures is rapidly outpacing our ability to functionally annotate them. Existing methods for function prediction either do not operate on local sites, suffer from high false positive or false negative rates, or require large site‐specific training datasets, necessitating the development of new computational methods for annotating functional sites at scale. We present COLLAPSE (Compressed Latents Learned from Aligned Protein Structural Environments), a framework for learning deep representations of protein sites. COLLAPSE operates directly on the 3D positions of atoms surrounding a site and uses evolutionary relationships between homologous proteins as a self‐supervision signal, enabling learned embeddings to implicitly capture structure–function relationships within each site. Our representations generalize across disparate tasks in a transfer learning context, achieving state‐of‐the‐art performance on standardized benchmarks (protein–protein interactions and mutation stability) and on the prediction of functional sites from the Prosite database. We use COLLAPSE to search for similar sites across large protein datasets and to annotate proteins based on a database of known functional sites. These methods demonstrate that COLLAPSE is computationally efficient, tunable, and interpretable, providing a general‐purpose platform for computational protein analysis.
Bibliography:	Funding information Review Editor Nir Ben‐Tal Chan Zuckerberg Initiative; National Institutes of Health, Grant/Award Number: GM102365; U.S. National Library of Medicine, Grant/Award Number: LM012409 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Review Editor: Nir Ben‐Tal Funding information Chan Zuckerberg Initiative; National Institutes of Health, Grant/Award Number: GM102365; U.S. National Library of Medicine, Grant/Award Number: LM012409
ISSN:	0961-8368 1469-896X 1469-896X
DOI:	10.1002/pro.4541