BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection

A promoter is a sequence of DNA that initializes the process of transcription and regulates whenever and wherever genes are expressed in the organism. Because of its importance in molecular biology, identifying DNA promoters are challenging to provide useful information related to its functions and...

Full description

Saved in:
Bibliographic Details
Published inComputational biology and chemistry Vol. 99; p. 107732
Main Authors Le, Nguyen Quoc Khanh, Ho, Quang-Thai, Nguyen, Van-Nui, Chang, Jung-Su
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.08.2022
Subjects
Online AccessGet full text
ISSN1476-9271
1476-928X
1476-928X
DOI10.1016/j.compbiolchem.2022.107732

Cover

Abstract A promoter is a sequence of DNA that initializes the process of transcription and regulates whenever and wherever genes are expressed in the organism. Because of its importance in molecular biology, identifying DNA promoters are challenging to provide useful information related to its functions and related diseases. Several computational models have been developed to early predict promoters from high-throughput sequencing over the past decade. Although some useful predictors have been proposed, there remains short-falls in those models and there is an urgent need to enhance the predictive performance to meet the practice requirements. In this study, we proposed a novel architecture that incorporated transformer natural language processing (NLP) and explainable machine learning to address this problem. More specifically, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model was employed to encode DNA sequences, and SHapley Additive exPlanations (SHAP) analysis served as a feature selection step to look at the top-rank BERT encodings. At the last stage, different machine learning classifiers were implemented to learn the top features and produce the prediction outcomes. This study not only predicted the DNA promoters but also their activities (strong or weak promoters). Overall, several experiments showed an accuracy of 85.5 % and 76.9 % for these two levels, respectively. Our performance showed a superiority to previously published predictors on the same dataset in most measurement metrics. We named our predictor as BERT-Promoter and it is freely available at https://github.com/khanhlee/bert-promoter.
AbstractList A promoter is a sequence of DNA that initializes the process of transcription and regulates whenever and wherever genes are expressed in the organism. Because of its importance in molecular biology, identifying DNA promoters are challenging to provide useful information related to its functions and related diseases. Several computational models have been developed to early predict promoters from high-throughput sequencing over the past decade. Although some useful predictors have been proposed, there remains short-falls in those models and there is an urgent need to enhance the predictive performance to meet the practice requirements. In this study, we proposed a novel architecture that incorporated transformer natural language processing (NLP) and explainable machine learning to address this problem. More specifically, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model was employed to encode DNA sequences, and SHapley Additive exPlanations (SHAP) analysis served as a feature selection step to look at the top-rank BERT encodings. At the last stage, different machine learning classifiers were implemented to learn the top features and produce the prediction outcomes. This study not only predicted the DNA promoters but also their activities (strong or weak promoters). Overall, several experiments showed an accuracy of 85.5 % and 76.9 % for these two levels, respectively. Our performance showed a superiority to previously published predictors on the same dataset in most measurement metrics. We named our predictor as BERT-Promoter and it is freely available at https://github.com/khanhlee/bert-promoter.A promoter is a sequence of DNA that initializes the process of transcription and regulates whenever and wherever genes are expressed in the organism. Because of its importance in molecular biology, identifying DNA promoters are challenging to provide useful information related to its functions and related diseases. Several computational models have been developed to early predict promoters from high-throughput sequencing over the past decade. Although some useful predictors have been proposed, there remains short-falls in those models and there is an urgent need to enhance the predictive performance to meet the practice requirements. In this study, we proposed a novel architecture that incorporated transformer natural language processing (NLP) and explainable machine learning to address this problem. More specifically, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model was employed to encode DNA sequences, and SHapley Additive exPlanations (SHAP) analysis served as a feature selection step to look at the top-rank BERT encodings. At the last stage, different machine learning classifiers were implemented to learn the top features and produce the prediction outcomes. This study not only predicted the DNA promoters but also their activities (strong or weak promoters). Overall, several experiments showed an accuracy of 85.5 % and 76.9 % for these two levels, respectively. Our performance showed a superiority to previously published predictors on the same dataset in most measurement metrics. We named our predictor as BERT-Promoter and it is freely available at https://github.com/khanhlee/bert-promoter.
A promoter is a sequence of DNA that initializes the process of transcription and regulates whenever and wherever genes are expressed in the organism. Because of its importance in molecular biology, identifying DNA promoters are challenging to provide useful information related to its functions and related diseases. Several computational models have been developed to early predict promoters from high-throughput sequencing over the past decade. Although some useful predictors have been proposed, there remains short-falls in those models and there is an urgent need to enhance the predictive performance to meet the practice requirements. In this study, we proposed a novel architecture that incorporated transformer natural language processing (NLP) and explainable machine learning to address this problem. More specifically, a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model was employed to encode DNA sequences, and SHapley Additive exPlanations (SHAP) analysis served as a feature selection step to look at the top-rank BERT encodings. At the last stage, different machine learning classifiers were implemented to learn the top features and produce the prediction outcomes. This study not only predicted the DNA promoters but also their activities (strong or weak promoters). Overall, several experiments showed an accuracy of 85.5 % and 76.9 % for these two levels, respectively. Our performance showed a superiority to previously published predictors on the same dataset in most measurement metrics. We named our predictor as BERT-Promoter and it is freely available at https://github.com/khanhlee/bert-promoter.
ArticleNumber 107732
Author Ho, Quang-Thai
Chang, Jung-Su
Nguyen, Van-Nui
Le, Nguyen Quoc Khanh
Author_xml – sequence: 1
  givenname: Nguyen Quoc Khanh
  surname: Le
  fullname: Le, Nguyen Quoc Khanh
  email: khanhlee@tmu.edu.tw
  organization: Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei City 106, Taiwan
– sequence: 2
  givenname: Quang-Thai
  surname: Ho
  fullname: Ho, Quang-Thai
  organization: College of Information & Communication Technology, Can Tho University, Viet Nam
– sequence: 3
  givenname: Van-Nui
  surname: Nguyen
  fullname: Nguyen, Van-Nui
  organization: University of Information and Communication Technology, Thai Nguyen University, Thai Nguyen, Viet Nam
– sequence: 4
  givenname: Jung-Su
  surname: Chang
  fullname: Chang, Jung-Su
  organization: School of Nutrition and Health Sciences, College of Nutrition, Taipei Medical University, Taipei 110, Taiwan
BookMark eNqNkUlLBDEUhIMouP6H4MlLj1kmnRlPjrsgKi7gLcTkRTN0J2PSI4h_3jQtIp48Zav6eFXZRKshBkBol5IRJbTen49MbBfPPjbmFdoRI4yVByk5W0EbdCzrasomT6s_e0nX0WbOc0IYJ0RsoM-j07uH6jbFNnaQDvAsYN8uUnwHizO8LSEYqJ51LsdFAutNFxOODp9cz8rF4MLL7MML7km9qOqS9qEY2mihwTpYfH8xu8UOdLdMULANmM7HsI3WnG4y7HyvW-jx7PTh-KK6ujm_PJ5dVYYL2VV2AswRIo2xRoKuoUSpjWBEW-GsEFPihKCMcc4mgmrH2JgbDpQ7OYFaa76F9gZuGbgkyp1qfTbQNDpAXGbF6imXcjrmrEgPB6lJMecEThnf6X7YPlSjKFF972qufveu-t7V0HtBHPxBLJJvdfr4n_lkMEPp491DUtn4_g-sT6U0ZaP_D-YLkEKpGQ
CitedBy_id crossref_primary_10_1038_s41598_024_73342_7
crossref_primary_10_1002_2211_5463_70003
crossref_primary_10_3390_electronics11213577
crossref_primary_10_1002_prot_26536
crossref_primary_10_1093_femsre_fuad030
crossref_primary_10_1155_2022_3265212
crossref_primary_10_1186_s12880_023_01129_9
crossref_primary_10_2174_0115748936264316230926073231
crossref_primary_10_1016_j_bspc_2023_104593
crossref_primary_10_1109_JBHI_2023_3286917
crossref_primary_10_1016_j_ymeth_2024_05_007
crossref_primary_10_3934_mbe_2023586
crossref_primary_10_1109_JBHI_2023_3309840
crossref_primary_10_2174_0115748936285544231221113226
crossref_primary_10_1007_s00521_024_10663_8
crossref_primary_10_1111_bcp_16032
crossref_primary_10_1109_ACCESS_2023_3285197
crossref_primary_10_3389_fgene_2023_1233657
crossref_primary_10_1109_JBHI_2023_3299042
crossref_primary_10_3390_a15110410
crossref_primary_10_1016_j_neucom_2024_128829
crossref_primary_10_1186_s12915_024_01923_z
crossref_primary_10_1109_ACCESS_2023_3272056
crossref_primary_10_3390_biomedicines11020581
crossref_primary_10_1186_s13059_023_02955_4
crossref_primary_10_1109_ACCESS_2023_3324061
crossref_primary_10_1007_s10462_023_10692_0
crossref_primary_10_1007_s10722_024_01879_7
crossref_primary_10_3389_frai_2022_1040295
crossref_primary_10_3934_era_2023335
crossref_primary_10_1093_jamia_ocaf029
crossref_primary_10_1109_ACCESS_2023_3297207
crossref_primary_10_1111_1751_7915_70121
crossref_primary_10_3389_fgene_2023_1232038
crossref_primary_10_1109_ACCESS_2023_3326337
crossref_primary_10_2174_1574893618666230316151648
crossref_primary_10_1021_acs_jcim_4c01415
crossref_primary_10_1093_bib_bbad058
crossref_primary_10_1016_j_ymeth_2024_08_005
crossref_primary_10_1109_JBHI_2023_3292299
crossref_primary_10_1186_s12864_023_09796_2
crossref_primary_10_1371_journal_pone_0287031
crossref_primary_10_1155_2023_8862598
crossref_primary_10_1108_K_03_2024_0554
crossref_primary_10_1016_j_compbiomed_2022_106375
crossref_primary_10_1016_j_csbj_2025_03_024
crossref_primary_10_1007_s44174_024_00197_x
crossref_primary_10_1016_j_knosys_2023_111316
crossref_primary_10_1109_ACCESS_2022_3233768
crossref_primary_10_1016_j_ab_2024_115495
crossref_primary_10_1007_s00521_023_08706_7
crossref_primary_10_1016_j_compbiolchem_2024_108040
crossref_primary_10_1186_s12938_024_01219_x
crossref_primary_10_1109_ACCESS_2023_3280123
crossref_primary_10_3390_ijms242015447
crossref_primary_10_1016_j_heliyon_2024_e28443
crossref_primary_10_1186_s12859_023_05543_2
crossref_primary_10_1038_s41598_024_84105_9
crossref_primary_10_1186_s12859_024_05849_9
crossref_primary_10_3390_biomedicines11051323
crossref_primary_10_2166_hydro_2023_046
Cites_doi 10.3389/fbioe.2019.00305
10.1007/978-1-61779-376-9_6
10.1093/bioinformatics/btz682
10.1016/j.chemolab.2020.104034
10.1371/journal.pone.0171410
10.1093/bib/bbz041
10.1016/j.gene.2021.145643
10.1371/journal.pone.0137950
10.1016/j.ygeno.2018.12.001
10.1186/gb-2006-7-s1-s3
10.1073/pnas.91.4.1460
10.1164/ajrccm.158.6.9804011
10.1016/j.ygeno.2020.01.017
10.1093/bioinformatics/btaa1087
10.1016/j.ygeno.2019.08.009
10.1038/ng780
10.1093/bioinformatics/btg265
10.1093/bioinformatics/btab133
10.1093/nar/gkv1156
10.1002/bies.20734
10.3389/fgene.2019.00286
10.1093/nar/gkg525
10.1093/bib/bbab005
10.1101/2020.09.17.301879
10.1186/1471-2105-6-1
ContentType Journal Article
Copyright 2022 Elsevier Ltd
Copyright © 2022 Elsevier Ltd. All rights reserved.
Copyright_xml – notice: 2022 Elsevier Ltd
– notice: Copyright © 2022 Elsevier Ltd. All rights reserved.
DBID AAYXX
CITATION
7X8
DOI 10.1016/j.compbiolchem.2022.107732
DatabaseName CrossRef
MEDLINE - Academic
DatabaseTitle CrossRef
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic

DeliveryMethod fulltext_linktorsrc
Discipline Chemistry
Biology
EISSN 1476-928X
ExternalDocumentID 10_1016_j_compbiolchem_2022_107732
S1476927122001128
GroupedDBID ---
--K
--M
.DC
.~1
0R~
1B1
1~.
1~5
29F
4.4
457
4G.
53G
5GY
5VS
7-5
71M
8P~
AACTN
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AARLI
AAXUO
AAYFN
ABBOA
ABGSF
ABMAC
ABNUV
ABUDA
ABXDB
ABYKQ
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADECG
ADEWK
ADEZE
ADJOM
ADMUD
ADUVX
AEBSH
AEHWI
AEKER
AENEX
AFKWA
AFTJW
AFXIZ
AFZHZ
AGHFR
AGRDE
AGUBO
AGYEJ
AHPOS
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJBFU
AJOXV
AJSZI
AKURH
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
AXJTR
BKOJK
BLXMC
CS3
DOVZS
DU5
EBS
EFJIC
EFLBG
EJD
ENUVR
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FIRID
FLBIZ
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HVGLF
HZ~
IHE
J1W
KOM
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
RIG
RNS
ROL
RPZ
SCB
SDF
SDG
SES
SEW
SPC
SPCBC
SSG
SSK
SSU
SSV
SSZ
T5K
UHS
ZMT
~G-
AAHBH
AATTM
AAXKI
AAYWO
AAYXX
ABJNI
ABWVN
ACRPL
ACVFH
ADCNI
ADNMO
AEIPS
AEUPX
AFJKZ
AFPUW
AGCQF
AGRNS
AIGII
AIIUN
AKBMS
AKRWK
AKYEP
ANKPU
APXCP
BNPGV
CITATION
SSH
7X8
EFKBS
ID FETCH-LOGICAL-c357t-d8e2f007ccdc7ea6e4766c520ad5fd5590f55122332851af2243c3e13f78e6aa3
IEDL.DBID AIKHN
ISSN 1476-9271
1476-928X
IngestDate Fri Sep 05 14:28:07 EDT 2025
Tue Jul 01 02:02:16 EDT 2025
Thu Apr 24 22:57:16 EDT 2025
Fri Feb 23 02:40:44 EST 2024
IsPeerReviewed true
IsScholarly true
Keywords Explainable artificial intelligence
SHAP
Promoter region
BERT multilingual cases
EXtreme Gradient Boosting
Contextualized word embedding
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c357t-d8e2f007ccdc7ea6e4766c520ad5fd5590f55122332851af2243c3e13f78e6aa3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
PQID 2693779432
PQPubID 23479
ParticipantIDs proquest_miscellaneous_2693779432
crossref_citationtrail_10_1016_j_compbiolchem_2022_107732
crossref_primary_10_1016_j_compbiolchem_2022_107732
elsevier_sciencedirect_doi_10_1016_j_compbiolchem_2022_107732
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate August 2022
2022-08-00
20220801
PublicationDateYYYYMMDD 2022-08-01
PublicationDate_xml – month: 08
  year: 2022
  text: August 2022
PublicationDecade 2020
PublicationTitle Computational biology and chemistry
PublicationYear 2022
Publisher Elsevier Ltd
Publisher_xml – name: Elsevier Ltd
References Solovyev, Shahmuradov (bib25) 2003; 31
Chen (bib3) 2020; 21
Lundberg, Lee (bib21) 2017; 30
Lee (bib19) 2020; 36
Vlahopoulos (bib29) 2008; 30
Lai, Lu (bib15) 2020; 36
Tayara, Tahir, Chong (bib27) 2020; 112
Kanhere, Bansal (bib13) 2005; 6
Oubounyt (bib23) 2019; 10
Ji, Y., et al., DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 2021.
Lin (bib20) 2018
Devlin, J., et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1 (Long and Short Papers). 2019.
Ionescu-Tîrgovişte, Gagniuc, Guja (bib11) 2015; 10
Tahir (bib26) 2020; 202
Davuluri, Grosse, Zhang (bib4) 2001; 29
Khambata-Ford (bib14) 2003; 13
Gordon (bib9) 2003; 19
Le, N.Q.K., et al., A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform, 2021a.
Xiao (bib30) 2019; 111
Gama-Castro (bib8) 2015; 44
Hobbs (bib10) 1998; 158
Umarov, Solovyev (bib28) 2017; 12
Bajic (bib1) 2006; 7
Charoenkwan, P., et al., BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides. Bioinformatics, 2021.
Lyu (bib22) 2020
Le (bib18) 2021; 787
Do, Le (bib6) 2020; 112
Le (bib16) 2019; 7
Rettinger (bib24) 1994; 91
Gade, P. , D.V. Kalvakolanu, Chromatin Immunoprecipitation assay as a tool for analyzing transcription factor activity. In: Vancura, A., (Ed.), Transcriptional Regulation: Methods and Protocols, 2012, Springer, New York, NY., 85–104.
Lai (10.1016/j.compbiolchem.2022.107732_bib15) 2020; 36
Umarov (10.1016/j.compbiolchem.2022.107732_bib28) 2017; 12
Do (10.1016/j.compbiolchem.2022.107732_bib6) 2020; 112
Le (10.1016/j.compbiolchem.2022.107732_bib18) 2021; 787
Davuluri (10.1016/j.compbiolchem.2022.107732_bib4) 2001; 29
Lundberg (10.1016/j.compbiolchem.2022.107732_bib21) 2017; 30
Lyu (10.1016/j.compbiolchem.2022.107732_bib22) 2020
10.1016/j.compbiolchem.2022.107732_bib17
Bajic (10.1016/j.compbiolchem.2022.107732_bib1) 2006; 7
Kanhere (10.1016/j.compbiolchem.2022.107732_bib13) 2005; 6
Ionescu-Tîrgovişte (10.1016/j.compbiolchem.2022.107732_bib11) 2015; 10
Khambata-Ford (10.1016/j.compbiolchem.2022.107732_bib14) 2003; 13
Chen (10.1016/j.compbiolchem.2022.107732_bib3) 2020; 21
Hobbs (10.1016/j.compbiolchem.2022.107732_bib10) 1998; 158
Le (10.1016/j.compbiolchem.2022.107732_bib16) 2019; 7
Gordon (10.1016/j.compbiolchem.2022.107732_bib9) 2003; 19
Lin (10.1016/j.compbiolchem.2022.107732_bib20) 2018
Solovyev (10.1016/j.compbiolchem.2022.107732_bib25) 2003; 31
Vlahopoulos (10.1016/j.compbiolchem.2022.107732_bib29) 2008; 30
Rettinger (10.1016/j.compbiolchem.2022.107732_bib24) 1994; 91
10.1016/j.compbiolchem.2022.107732_bib7
Xiao (10.1016/j.compbiolchem.2022.107732_bib30) 2019; 111
Oubounyt (10.1016/j.compbiolchem.2022.107732_bib23) 2019; 10
10.1016/j.compbiolchem.2022.107732_bib5
10.1016/j.compbiolchem.2022.107732_bib12
Tayara (10.1016/j.compbiolchem.2022.107732_bib27) 2020; 112
Lee (10.1016/j.compbiolchem.2022.107732_bib19) 2020; 36
10.1016/j.compbiolchem.2022.107732_bib2
Gama-Castro (10.1016/j.compbiolchem.2022.107732_bib8) 2015; 44
Tahir (10.1016/j.compbiolchem.2022.107732_bib26) 2020; 202
References_xml – volume: 19
  start-page: 1964
  year: 2003
  end-page: 1971
  ident: bib9
  article-title: Sequence alignment kernel for recognition of promoter regions
  publication-title: Bioinformatics
– volume: 10
  start-page: 286
  year: 2019
  ident: bib23
  article-title: DeePromoter: robust promoter predictor using deep learning
  publication-title: Front. Genet.
– volume: 30
  start-page: 314
  year: 2008
  end-page: 327
  ident: bib29
  article-title: The role of ATF-2 in oncogenesis
  publication-title: BioEssays
– reference: Charoenkwan, P., et al., BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides. Bioinformatics, 2021.
– volume: 7
  start-page: 305
  year: 2019
  ident: bib16
  article-title: Classifying promoters by interpreting the hidden information of DNA sequences via deep learning and combination of continuous fasttext N-grams
  publication-title: Front Bioeng. Biotechnol.
– volume: 31
  start-page: 3540
  year: 2003
  end-page: 3545
  ident: bib25
  article-title: PromH: promoters identification using orthologous genomic sequences
  publication-title: Nucleic Acids Res.
– volume: 29
  start-page: 412
  year: 2001
  end-page: 417
  ident: bib4
  article-title: Computational identification of promoters and first exons in the human genome
  publication-title: Nat. Genet.
– volume: 7
  start-page: S3
  year: 2006
  ident: bib1
  article-title: Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment
  publication-title: Genome Biol.
– volume: 21
  start-page: 1047
  year: 2020
  end-page: 1057
  ident: bib3
  article-title: iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data
  publication-title: Brief. Bioinform.
– volume: 91
  start-page: 1460
  year: 1994
  end-page: 1464
  ident: bib24
  article-title: Liver-directed gene therapy: quantitative evaluation of promoter elements by using in vivo retroviral transduction
  publication-title: Proc. Natl. Acad. Sci. USA
– reference: Devlin, J., et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1 (Long and Short Papers). 2019.
– volume: 12
  year: 2017
  ident: bib28
  article-title: Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks
  publication-title: PLoS One
– volume: 10
  year: 2015
  ident: bib11
  article-title: Structural properties of gene promoters highlight more than two phenotypes of diabetes
  publication-title: PLoS One
– volume: 44
  start-page: D133
  year: 2015
  end-page: D143
  ident: bib8
  article-title: RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond
  publication-title: Nucleic Acids Res.
– volume: 36
  start-page: 5678
  year: 2020
  end-page: 5685
  ident: bib15
  article-title: BERT-GT: cross-sentence n-ary relation extraction with BERT and Graph Transformer
  publication-title: Bioinformatics
– reference: Gade, P. , D.V. Kalvakolanu, Chromatin Immunoprecipitation assay as a tool for analyzing transcription factor activity. In: Vancura, A., (Ed.), Transcriptional Regulation: Methods and Protocols, 2012, Springer, New York, NY., 85–104.
– volume: 202
  year: 2020
  ident: bib26
  article-title: An intelligent computational model for prediction of promoters and their strength via natural language processing
  publication-title: Chemom. Intell. Lab. Syst.
– volume: 30
  start-page: 4765
  year: 2017
  end-page: 4774
  ident: bib21
  article-title: A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems
– volume: 36
  start-page: 1234
  year: 2020
  end-page: 1240
  ident: bib19
  article-title: BioBERT: a pre-trained biomedical language representation model for biomedical text mining
  publication-title: Bioinformatics
– volume: 112
  start-page: 2445
  year: 2020
  end-page: 2451
  ident: bib6
  article-title: Using extreme gradient boosting to identify origin of replication in Saccharomyces cerevisiae via hybrid features
  publication-title: Genomics
– year: 2020
  ident: bib22
  article-title: iPro2L-PSTKNC: a two-layer predictor for discovering various types of promoters by position specific of nucleotide composition
  publication-title: IEEE J. Biomed. Health Inf.
– reference: Le, N.Q.K., et al., A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform, 2021a.
– volume: 6
  start-page: 1
  year: 2005
  ident: bib13
  article-title: A novel method for prokaryotic promoter prediction based on DNA stability
  publication-title: BMC Bioinform.
– reference: Ji, Y., et al., DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 2021.
– volume: 787
  year: 2021
  ident: bib18
  article-title: A sequence-based prediction of Kruppel-like factors proteins using XGBoost and optimized features
  publication-title: Gene
– volume: 13
  start-page: 1765
  year: 2003
  end-page: 1774
  ident: bib14
  article-title: Identification of promoter regions in the human genome by using a retroviral plasmid library-based functional reporter gene assay
  publication-title: Genome Biol.
– volume: 111
  start-page: 1785
  year: 2019
  end-page: 1793
  ident: bib30
  article-title: iPSW(2L)-PseKNC: a two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition
  publication-title: Genomics
– volume: 112
  start-page: 1396
  year: 2020
  end-page: 1403
  ident: bib27
  article-title: Identification of prokaryotic promoters and their strength by integrating heterogeneous features
  publication-title: Genomics
– year: 2018
  ident: bib20
  article-title: Identifying sigma70 promoters with novel pseudo nucleotide composition
  publication-title: IEEE/ACM Trans. Comput. Biol. Bioinform.
– volume: 158
  start-page: 1958
  year: 1998
  end-page: 1962
  ident: bib10
  article-title: Interleukin-10 and transforming growth factor- β promoter polymorphisms in allergies and asthma
  publication-title: Am. J. Respir. Crit. Care Med.
– volume: 7
  start-page: 305
  year: 2019
  ident: 10.1016/j.compbiolchem.2022.107732_bib16
  article-title: Classifying promoters by interpreting the hidden information of DNA sequences via deep learning and combination of continuous fasttext N-grams
  publication-title: Front Bioeng. Biotechnol.
  doi: 10.3389/fbioe.2019.00305
– ident: 10.1016/j.compbiolchem.2022.107732_bib7
  doi: 10.1007/978-1-61779-376-9_6
– volume: 36
  start-page: 1234
  issue: 4
  year: 2020
  ident: 10.1016/j.compbiolchem.2022.107732_bib19
  article-title: BioBERT: a pre-trained biomedical language representation model for biomedical text mining
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btz682
– volume: 202
  year: 2020
  ident: 10.1016/j.compbiolchem.2022.107732_bib26
  article-title: An intelligent computational model for prediction of promoters and their strength via natural language processing
  publication-title: Chemom. Intell. Lab. Syst.
  doi: 10.1016/j.chemolab.2020.104034
– volume: 12
  issue: 2
  year: 2017
  ident: 10.1016/j.compbiolchem.2022.107732_bib28
  article-title: Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks
  publication-title: PLoS One
  doi: 10.1371/journal.pone.0171410
– volume: 21
  start-page: 1047
  issue: 3
  year: 2020
  ident: 10.1016/j.compbiolchem.2022.107732_bib3
  article-title: iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data
  publication-title: Brief. Bioinform.
  doi: 10.1093/bib/bbz041
– volume: 787
  year: 2021
  ident: 10.1016/j.compbiolchem.2022.107732_bib18
  article-title: A sequence-based prediction of Kruppel-like factors proteins using XGBoost and optimized features
  publication-title: Gene
  doi: 10.1016/j.gene.2021.145643
– volume: 10
  issue: 9
  year: 2015
  ident: 10.1016/j.compbiolchem.2022.107732_bib11
  article-title: Structural properties of gene promoters highlight more than two phenotypes of diabetes
  publication-title: PLoS One
  doi: 10.1371/journal.pone.0137950
– year: 2018
  ident: 10.1016/j.compbiolchem.2022.107732_bib20
  article-title: Identifying sigma70 promoters with novel pseudo nucleotide composition
  publication-title: IEEE/ACM Trans. Comput. Biol. Bioinform.
– volume: 111
  start-page: 1785
  issue: 6
  year: 2019
  ident: 10.1016/j.compbiolchem.2022.107732_bib30
  article-title: iPSW(2L)-PseKNC: a two-layer predictor for identifying promoters and their strength by hybrid features via pseudo K-tuple nucleotide composition
  publication-title: Genomics
  doi: 10.1016/j.ygeno.2018.12.001
– volume: 7
  start-page: S3
  issue: 1
  year: 2006
  ident: 10.1016/j.compbiolchem.2022.107732_bib1
  article-title: Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment
  publication-title: Genome Biol.
  doi: 10.1186/gb-2006-7-s1-s3
– year: 2020
  ident: 10.1016/j.compbiolchem.2022.107732_bib22
  article-title: iPro2L-PSTKNC: a two-layer predictor for discovering various types of promoters by position specific of nucleotide composition
  publication-title: IEEE J. Biomed. Health Inf.
– volume: 91
  start-page: 1460
  issue: 4
  year: 1994
  ident: 10.1016/j.compbiolchem.2022.107732_bib24
  article-title: Liver-directed gene therapy: quantitative evaluation of promoter elements by using in vivo retroviral transduction
  publication-title: Proc. Natl. Acad. Sci. USA
  doi: 10.1073/pnas.91.4.1460
– volume: 158
  start-page: 1958
  issue: 6
  year: 1998
  ident: 10.1016/j.compbiolchem.2022.107732_bib10
  article-title: Interleukin-10 and transforming growth factor- β promoter polymorphisms in allergies and asthma
  publication-title: Am. J. Respir. Crit. Care Med.
  doi: 10.1164/ajrccm.158.6.9804011
– ident: 10.1016/j.compbiolchem.2022.107732_bib5
– volume: 112
  start-page: 2445
  issue: 3
  year: 2020
  ident: 10.1016/j.compbiolchem.2022.107732_bib6
  article-title: Using extreme gradient boosting to identify origin of replication in Saccharomyces cerevisiae via hybrid features
  publication-title: Genomics
  doi: 10.1016/j.ygeno.2020.01.017
– volume: 36
  start-page: 5678
  issue: 24
  year: 2020
  ident: 10.1016/j.compbiolchem.2022.107732_bib15
  article-title: BERT-GT: cross-sentence n-ary relation extraction with BERT and Graph Transformer
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btaa1087
– volume: 13
  start-page: 1765
  issue: 7
  year: 2003
  ident: 10.1016/j.compbiolchem.2022.107732_bib14
  article-title: Identification of promoter regions in the human genome by using a retroviral plasmid library-based functional reporter gene assay
  publication-title: Genome Biol.
– volume: 112
  start-page: 1396
  issue: 2
  year: 2020
  ident: 10.1016/j.compbiolchem.2022.107732_bib27
  article-title: Identification of prokaryotic promoters and their strength by integrating heterogeneous features
  publication-title: Genomics
  doi: 10.1016/j.ygeno.2019.08.009
– volume: 29
  start-page: 412
  issue: 4
  year: 2001
  ident: 10.1016/j.compbiolchem.2022.107732_bib4
  article-title: Computational identification of promoters and first exons in the human genome
  publication-title: Nat. Genet.
  doi: 10.1038/ng780
– volume: 19
  start-page: 1964
  issue: 15
  year: 2003
  ident: 10.1016/j.compbiolchem.2022.107732_bib9
  article-title: Sequence alignment kernel for recognition of promoter regions
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btg265
– ident: 10.1016/j.compbiolchem.2022.107732_bib2
  doi: 10.1093/bioinformatics/btab133
– volume: 30
  start-page: 4765
  year: 2017
  ident: 10.1016/j.compbiolchem.2022.107732_bib21
  article-title: A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems
– volume: 44
  start-page: D133
  issue: D1
  year: 2015
  ident: 10.1016/j.compbiolchem.2022.107732_bib8
  article-title: RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkv1156
– volume: 30
  start-page: 314
  issue: 4
  year: 2008
  ident: 10.1016/j.compbiolchem.2022.107732_bib29
  article-title: The role of ATF-2 in oncogenesis
  publication-title: BioEssays
  doi: 10.1002/bies.20734
– volume: 10
  start-page: 286
  year: 2019
  ident: 10.1016/j.compbiolchem.2022.107732_bib23
  article-title: DeePromoter: robust promoter predictor using deep learning
  publication-title: Front. Genet.
  doi: 10.3389/fgene.2019.00286
– volume: 31
  start-page: 3540
  issue: 13
  year: 2003
  ident: 10.1016/j.compbiolchem.2022.107732_bib25
  article-title: PromH: promoters identification using orthologous genomic sequences
  publication-title: Nucleic Acids Res.
  doi: 10.1093/nar/gkg525
– ident: 10.1016/j.compbiolchem.2022.107732_bib17
  doi: 10.1093/bib/bbab005
– ident: 10.1016/j.compbiolchem.2022.107732_bib12
  doi: 10.1101/2020.09.17.301879
– volume: 6
  start-page: 1
  issue: 1
  year: 2005
  ident: 10.1016/j.compbiolchem.2022.107732_bib13
  article-title: A novel method for prokaryotic promoter prediction based on DNA stability
  publication-title: BMC Bioinform.
  doi: 10.1186/1471-2105-6-1
SSID ssj0023005
Score 2.5679345
Snippet A promoter is a sequence of DNA that initializes the process of transcription and regulates whenever and wherever genes are expressed in the organism. Because...
SourceID proquest
crossref
elsevier
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 107732
SubjectTerms BERT multilingual cases
Contextualized word embedding
Explainable artificial intelligence
EXtreme Gradient Boosting
Promoter region
SHAP
Title BERT-Promoter: An improved sequence-based predictor of DNA promoter using BERT pre-trained model and SHAP feature selection
URI https://dx.doi.org/10.1016/j.compbiolchem.2022.107732
https://www.proquest.com/docview/2693779432
Volume 99
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1La9wwEB6SDaW9hDZtaZJ2UaFXdW3Ja9mBHtxtwvbBsuQBuQlZj7CleJfN5hAK_e2Zke3QFgqBHv0YW8wn5iF9mgF4Z51LPS29icKUPKvzgtchDdwnxqmgYg0UYlvM8ulF9uVyfLkFk_4sDNEqO9vf2vRorbs7o06bo9ViMTpLM5WXQqWCaEFoZrdhR6C3LwawU33-Op3d511UkT0eMlI5J4G-9mikeRFzm-odoYroYLoQ-EApKf7lp_6y2NENnTyF3S5-ZFU7xGew5Zs9eNR2lLzdg8eTvoHbc_j58fj0nM8j386vj1jVsEVcQvCO9QxqTl7MsdWa9msw_WbLwD7NKrzRSjHixV8x-hK9xGNHCRSIDXSYaRw7m1ZzFnysD8quY1cdhPoFXJwcn0-mvOu1wK0cqw13hRcB4wVrnVXe5B6VlduxQMjGwWHakQSMrTCWkAJjNBPQ80srfSqDKnxujHwJg2bZ-FfALHr8El1emRU-w-yuVoUTtU2SOq9NnSX7UPaa1bYrRE6j_6F7xtl3_TsqmlDRLSr7IO9lV205jgdJfegB1H9MLo1-40Hyb3vUNWJIWyqm8cubay3ykio2ZlIc_Oc_DuEJXbXcwtcw2Kxv_BuMdzb1ELbf_0qHOKsnp9_mw2523wF2_gJ3
linkProvider Elsevier
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV3da9RAEB_qFakvolWxtuoKvi6X7OZ2k0If4tmS2noUe4W-Lcl-yInkjuv1QfznndkkRQWh4OtuJgnzW-Zj97czAO-tc6mnrTeR1wXPGpXzJqSB-6R2OuhYA4XYFjNVXWWfrifXWzAd7sIQrbK3_Z1Nj9a6Hxn32hyvFovxZZppVQidCqIFoZl9ANsZNbUewXZ5elbN7vIuqsgeLxlpxUlgqD0aaV7E3KZ6R6giupguBE5oLcW__NRfFju6oZMn8LiPH1nZ_eJT2PLtLjzsOkr-2IWd6dDA7Rn8_HD8Zc4vIt_Orw9Z2bJF3ELwjg0Mak5ezLHVms5rMP1my8A-zkoc6KQY8eK_MnoTPcRjRwkUiA10WN06dlmVFyz4WB-U3cSuOgj1c7g6OZ5PK973WuBWTvSGu9yLgPGCtc5qXyuPylJ2IhCySXCYdiQBYyuMJaTAGK0O6PmllT6VQede1bV8AaN22fqXwCx6_AJdXpHlPsPsrtG5E41NkkY1dZMle1AMmjW2L0ROf__dDIyzb-Z3VAyhYjpU9kDeya66chz3kjoaADR_LC6DfuNe8u8G1A1iSEcqdeuXtzdGqIIqNmZSvPrPb7yFnWr--dycn87O9uERzXQ8wwMYbda3_jXGPpvmTb-2fwF3SQLI
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=BERT-Promoter%3A+An+improved+sequence-based+predictor+of+DNA+promoter+using+BERT+pre-trained+model+and+SHAP+feature+selection&rft.jtitle=Computational+biology+and+chemistry&rft.au=Le%2C+Nguyen+Quoc+Khanh&rft.au=Ho%2C+Quang-Thai&rft.au=Nguyen%2C+Van-Nui&rft.au=Chang%2C+Jung-Su&rft.date=2022-08-01&rft.issn=1476-9271&rft.volume=99&rft.spage=107732&rft_id=info:doi/10.1016%2Fj.compbiolchem.2022.107732&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_compbiolchem_2022_107732
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1476-9271&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1476-9271&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1476-9271&client=summon