RPSubAlign: a novel sequence-based molecular representation method for retrosynthesis prediction with improved validity and robustness

Abstract Retrosynthetic route planning is essential for designing efficient pathways to synthesize complex molecules, serving as a cornerstone in drug discovery and organic synthesis. Sequence-based models have become a predominant approach in retrosynthetic route planning, yet its validity and robu...

Full description

Saved in:
Bibliographic Details
Published inBriefings in bioinformatics Vol. 26; no. 3
Main Authors Hu, Yuting, Hu, Feng, Zhang, Hongwen, Xu, Hongling, Gao, Jixiang, Deng, Wenshuai, Tian, Zijing, Hu, Qiaoyu, Li, Honglin, Diao, Yanyan
Format Journal Article
LanguageEnglish
Published England Oxford University Press 01.05.2025
Oxford Publishing Limited (England)
Subjects
Online AccessGet full text
ISSN1467-5463
1477-4054
1477-4054
DOI10.1093/bib/bbaf257

Cover

More Information
Summary:Abstract Retrosynthetic route planning is essential for designing efficient pathways to synthesize complex molecules, serving as a cornerstone in drug discovery and organic synthesis. Sequence-based models have become a predominant approach in retrosynthetic route planning, yet its validity and robustness remain limited by the challenges in molecular representation methods. Current methods typically treat reactants and products as independent molecules, overlooking structural relationships crucial for accurate synthesis predictions. Herein, we introduce RPSubAlign, a molecular sequence representation method specifically tailored for retrosynthetic tasks, which aligns common substructures between reactants and products to enhance the validity and robustness of sequence-based models. Compared with conventional random and root-alignment representations, RPSubAlign achieves better performance on the USPTO-50K and USPTO-MIT datasets, improving up to a 34.8% increase in Top-N accuracy (with Self-Referencing Embedded Strings representation) and demonstrating enhanced stability across various data augmentation scenarios. RPSubAlign significantly improves syntactic validity, reaching 86.64% on USPTO-50K and 96.45% on USPTO-MIT (with Simplified Molecular Input Line Entry System representation), outperforming baseline methods. These results highlight RPSubAlign as a robust, effective approach for molecular characterization method for retrosynthesis predictions. Code for RPSubAlign is available at https://github.com/Aminoacid1226/RPSubAlign.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
Yuting Hu, Feng Hu and Hongwen Zhang contribute equally to this work.
ISSN:1467-5463
1477-4054
1477-4054
DOI:10.1093/bib/bbaf257