SORFPP: Enhancing rich sequence-driven information to identify SEPs based on fused framework on validation datasets

Genome sequencing has enabled us to find functional peptides encoded by short open read frames (sORFs) in long non-coding RNAs (lncRNAs). sORFs-encoded peptides (SEPs) regulate gene expression, signaling, and so on and have significant roles, unlike common peptides. Various computational methods hav...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 20; no. 4; p. e0320314
Main Authors	Feng, Hongqi, Nie, Qi, Yang, Sen
Format	Journal Article
Language	English
Published	United States Public Library of Science 28.04.2025 Public Library of Science (PLoS)
Subjects	Algorithms Amino acid sequence Amino acids Analysis Biology and Life Sciences Computational Biology - methods Computer and Information Sciences Computer applications Correlation coefficient Correlation coefficients Datasets Deep learning DNA sequencing Ensemble learning Gene expression Gene sequencing Genomes Genomics Humans Identification Identification and classification Machine learning Methods MicroRNAs Nucleotide sequencing Open reading frames Open Reading Frames - genetics Peptides Peptides - genetics Proteins Regression models Research and Analysis Methods RNA, Long Noncoding - genetics Social Sciences Software Whole genome sequencing Taiwan China
Online Access	Get full text
ISSN	1932-6203 1932-6203
DOI	10.1371/journal.pone.0320314

Cover

More Information
Summary:	Genome sequencing has enabled us to find functional peptides encoded by short open read frames (sORFs) in long non-coding RNAs (lncRNAs). sORFs-encoded peptides (SEPs) regulate gene expression, signaling, and so on and have significant roles, unlike common peptides. Various computational methods have been proposed. However, there is a lack of contributive features and effective models. Therefore, a high-throughput computational method to predict SEPs is needed. We propose a computational method, SORFPP, to predict SEPs by mining feature information from multiple perspectives in an experimentally validated dataset from TranLnc. SORFPP fully extracts SEP sequence information using the protein language model ESM-2 and curated traditional encoding, including QSOrder, k-mer, etc. SORFPP uses CatBoost to solve the sparsity problem of traditional encoding. SORFPP also analyzes ESM-2 pre-training characterization information with the Self-attention model. Finally, an ensemble learning framework combines the two models and their results are fed into Logistic Regression model for accurate and robust predictions. For comparison, SORFPP outperforms other state-of-the-art models in Matthew correlation coefficient by 12.2%-24.2% on three benchmark datasets. Integrating the ensemble learning strategy with contributive traditional features and the protein language encoding methods shows better performance. Datasets and codes are accessible at https://doi.org/10.6084/m9.figshare.28079897 and http://111.229.198.94:5000/.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: The authors have declared that no competing interests exist.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0320314