MolPipeline : A python package for processing molecules with RDKit in scikit-learn

The open-source package scikit-learn provides various machine learning algorithms and data processing tools, including the Pipeline class, which allows users to prepend custom data transformation steps to the machine learning model. We introduce the MolPipeline package, which extends this concept to...

Full description

Saved in:

Bibliographic Details
Published in	ChemRxiv
Main Authors	Sieg, Jochen, Feldmann, Christian Wolfgang, Hemmerich, Jennifer, Stork, Conrad, Sandfort, Frederik, Eiden, Philipp, Mathea, Miriam
Format	Paper
Language	English
Published	Washington American Chemical Society 19.04.2024
Edition	1
Subjects	Algorithms Chemistry Chemoinformatics - Computational Chemistry Data processing Machine Learning Pipelines Theoretical and Computational Chemistry scikit-learn RDKit Python
Online Access	Get full text
ISSN	2573-2293
DOI	10.26434/chemrxiv-2024-kd11b

Cover

More Information
Summary:	The open-source package scikit-learn provides various machine learning algorithms and data processing tools, including the Pipeline class, which allows users to prepend custom data transformation steps to the machine learning model. We introduce the MolPipeline package, which extends this concept to chemoinformatics by wrapping default functionalities of RDKit, such as reading and writing SMILES strings or calculating molecular descriptors from a molecule object. We aimed to build an easy-to-use Python package to create completely automated end-to-end pipelines that scale to large data sets. Particular emphasis was put on handling erroneous instances, where resolution would require manual intervention in default pipelines. In addition, we included common cheminformatics tasks, like scaffold splits and molecular standardization, natively in the pipeline framework and adaptable for the needs of various projects.
Bibliography:	SourceType-Working Papers-1 ObjectType-Working Paper/Pre-Print-1 content type line 50
ISSN:	2573-2293
DOI:	10.26434/chemrxiv-2024-kd11b