SORFPP: Enhancing rich sequence-driven information to identify SEPs based on fused framework on validation datasets
Genome sequencing has enabled us to find functional peptides encoded by short open read frames (sORFs) in long non-coding RNAs (lncRNAs). sORFs-encoded peptides (SEPs) regulate gene expression, signaling, and so on and have significant roles, unlike common peptides. Various computational methods hav...
Saved in:
| Published in | PloS one Vol. 20; no. 4; p. e0320314 |
|---|---|
| Main Authors | , , |
| Format | Journal Article |
| Language | English |
| Published |
United States
Public Library of Science
28.04.2025
Public Library of Science (PLoS) |
| Subjects | |
| Online Access | Get full text |
| ISSN | 1932-6203 1932-6203 |
| DOI | 10.1371/journal.pone.0320314 |
Cover
| Summary: | Genome sequencing has enabled us to find functional peptides encoded by short open read frames (sORFs) in long non-coding RNAs (lncRNAs). sORFs-encoded peptides (SEPs) regulate gene expression, signaling, and so on and have significant roles, unlike common peptides. Various computational methods have been proposed. However, there is a lack of contributive features and effective models. Therefore, a high-throughput computational method to predict SEPs is needed.
We propose a computational method, SORFPP, to predict SEPs by mining feature information from multiple perspectives in an experimentally validated dataset from TranLnc. SORFPP fully extracts SEP sequence information using the protein language model ESM-2 and curated traditional encoding, including QSOrder, k-mer, etc. SORFPP uses CatBoost to solve the sparsity problem of traditional encoding. SORFPP also analyzes ESM-2 pre-training characterization information with the Self-attention model. Finally, an ensemble learning framework combines the two models and their results are fed into Logistic Regression model for accurate and robust predictions. For comparison, SORFPP outperforms other state-of-the-art models in Matthew correlation coefficient by 12.2%-24.2% on three benchmark datasets.
Integrating the ensemble learning strategy with contributive traditional features and the protein language encoding methods shows better performance. Datasets and codes are accessible at https://doi.org/10.6084/m9.figshare.28079897 and http://111.229.198.94:5000/. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: The authors have declared that no competing interests exist. |
| ISSN: | 1932-6203 1932-6203 |
| DOI: | 10.1371/journal.pone.0320314 |