Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data

Background Machine learning (ML) has a rich history in structural bioinformatics, and modern approaches, such as deep learning, are revolutionizing our knowledge of the subtle relationships between biomolecular sequence, structure, function, dynamics and evolution. As with any advance that rests upo...

Full description

Saved in:

Bibliographic Details
Published in	BMC bioinformatics Vol. 25; no. 1; pp. 11 - 24
Main Authors	Draizen, Eli J., Readey, John, Mura, Cameron, Bourne, Philip E.
Format	Journal Article
Language	English
Published	London BioMed Central 04.01.2024 BioMed Central Ltd Springer Nature B.V BMC
Subjects	Accessibility Algorithms Analysis Bioinformatics Biomedical and Life Sciences Cloud computing Computational Biology/Bioinformatics Computer Appl. in Life Sciences Data integrity Datasets Deep learning Electrostatic properties Electrostatics Evaluation Evolution Kinases Learning algorithms Life Sciences Machine learning Massively parallel workflows Microarrays Phylogeny Pipelines Prediction models Protein structure Proteins Software Structural bioinformatics Structure-function relationships Three-dimensional display systems Workflow United States Deep learning Structural bioinformatics Protein structure Machine learning Massively parallel workflows
Online Access	Get full text
ISSN	1471-2105 1471-2105
DOI	10.1186/s12859-023-05586-5

Cover

More Information
Summary:	Background Machine learning (ML) has a rich history in structural bioinformatics, and modern approaches, such as deep learning, are revolutionizing our knowledge of the subtle relationships between biomolecular sequence, structure, function, dynamics and evolution. As with any advance that rests upon statistical learning approaches, the recent progress in biomolecular sciences is enabled by the availability of vast volumes of sufficiently-variable data. To be useful, such data must be well-structured, machine-readable, intelligible and manipulable. These and related requirements pose challenges that become especially acute at the computational scales typical in ML. Furthermore, in structural bioinformatics such data generally relate to protein three-dimensional (3D) structures, which are inherently more complex than sequence-based data. A significant and recurring challenge concerns the creation of large, high-quality, openly-accessible datasets that can be used for specific training and benchmarking tasks in ML pipelines for predictive modeling projects, along with reproducible splits for training and testing. Results Here, we report ‘ Prop3D ’, a platform that allows for the creation, sharing and extensible reuse of libraries of protein domains, featurized with biophysical and evolutionary properties that can range from detailed, atomically-resolved physicochemical quantities (e.g., electrostatics) to coarser, residue-level features (e.g., phylogenetic conservation). As a community resource, we also supply a ‘ Prop3D-20sf ’ protein dataset, obtained by applying our approach to CATH . We have developed and deployed the Prop3D framework, both in the cloud and on local HPC resources, to systematically and reproducibly create comprehensive datasets via the Highly Scalable Data Service ( HSDS ). Our datasets are freely accessible via a public HSDS instance, or they can be used with accompanying Python wrappers for popular ML frameworks. Conclusion Prop3D and its associated Prop3D-20sf dataset can be of broad utility in at least three ways. Firstly, the Prop3D workflow code can be customized and deployed on various cloud-based compute platforms, with scalability achieved largely by saving the results to distributed HDF5 files via HSDS . Secondly, the linked Prop3D-20sf dataset provides a hand-crafted, already-featurized dataset of protein domains for 20 highly-populated CATH families; importantly, provision of this pre-computed resource can aid the more efficient development (and reproducible deployment) of ML pipelines. Thirdly, Prop3D-20sf ’s construction explicitly takes into account (in creating datasets and data-splits) the enigma of ‘data leakage’, stemming from the evolutionary relationships between proteins.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1471-2105 1471-2105
DOI:	10.1186/s12859-023-05586-5