Protein Sequence Classification Using Bidirectional Encoder Representations from Transformers (BERT) Approach

Proteins play a vital role by booming out a number of activities within an organism to withstand its life. The field of Natural Language Processing has successfully adapted deep learning to get a better insight into the semantic nature of languages. In this paper, we propose semantic approaches base...

Full description

Saved in:
Bibliographic Details
Published inSN computer science Vol. 4; no. 5; p. 481
Main Authors Balamurugan, R., Mohite, Saurabh, Raja, S. P.
Format Journal Article
LanguageEnglish
Published Singapore Springer Nature Singapore 01.09.2023
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN2661-8907
2662-995X
2661-8907
DOI10.1007/s42979-023-01980-1

Cover

More Information
Summary:Proteins play a vital role by booming out a number of activities within an organism to withstand its life. The field of Natural Language Processing has successfully adapted deep learning to get a better insight into the semantic nature of languages. In this paper, we propose semantic approaches based on deep learning to work with protein amino acid sequences and compare the performances of these approaches with traditional classifiers to predict their respective families. The Bidirectional Encoder Representations from Transformers (BERT) approach was tested over 103 protein families from UniProt consortium database. The results show the average prediction accuracy to 99.02%, testing accuracy to 97.70%, validation accuracy to 97.69%, Normalized Mutual Information (NMI) score on overall data to 98.45, on test data to 96.99, on validation data 96.93 with high weighted average F1 scores of 99.02 on overall data, 97.72 on test data and 97.70 on validation data, and high macro average F1 scores of 99.00 on overall data, 98.00 on test data and 98.00 on validation data. From the results, it is justified that our proposed approach is outperforming well when compared to the existing approaches.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2661-8907
2662-995X
2661-8907
DOI:10.1007/s42979-023-01980-1