BERT-5mC: an interpretable model for predicting 5-methylcytosine sites of DNA based on BERT

DNA 5-methylcytosine (5mC) is widely present in multicellular eukaryotes, which plays important roles in various developmental and physiological processes and a wide range of human diseases. Thus, it is essential to accurately detect the 5mC sites. Although current sequencing technologies can map ge...

Full description

Saved in:
Bibliographic Details
Published inPeerJ (San Francisco, CA) Vol. 11; p. e16600
Main Authors Wang, Shuyu, Liu, Yinbo, Liu, Yufeng, Zhang, Yong, Zhu, Xiaolei
Format Journal Article
LanguageEnglish
Published United States PeerJ. Ltd 08.12.2023
PeerJ, Inc
PeerJ Inc
Subjects
Online AccessGet full text
ISSN2167-8359
2167-8359
DOI10.7717/peerj.16600

Cover

More Information
Summary:DNA 5-methylcytosine (5mC) is widely present in multicellular eukaryotes, which plays important roles in various developmental and physiological processes and a wide range of human diseases. Thus, it is essential to accurately detect the 5mC sites. Although current sequencing technologies can map genome-wide 5mC sites, these experimental methods are both costly and time-consuming. To achieve a fast and accurate prediction of 5mC sites, we propose a new computational approach, BERT-5mC. First, we pre-trained a domain-specific BERT (bidirectional encoder representations from transformers) model by using human promoter sequences as language corpus . BERT is a deep two-way language representation model based on Transformer. Second, we fine-tuned the domain-specific BERT model based on the 5mC training dataset to build the model. The cross-validation results show that our model achieves an AUROC of 0.966 which is higher than other state-of-the-art methods such as iPromoter-5mC, 5mC_Pred, and BiLSTM-5mC. Furthermore, our model was evaluated on the independent test set, which shows that our model achieves an AUROC of 0.966 that is also higher than other state-of-the-art methods. Moreover, we analyzed the attention weights generated by BERT to identify a number of nucleotide distributions that are closely associated with 5mC modifications. To facilitate the use of our model, we built a webserver which can be freely accessed at: http://5mc-pred.zhulab.org.cn .
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:2167-8359
2167-8359
DOI:10.7717/peerj.16600