QuoteTarget: A sequence‐based transformer protein language model to identify potentially druggable protein targets

The development of efficient computational methods for drug target protein identification can compensate for the high cost of experiments and is therefore of great significance for drug development. However, existing structure‐based drug target protein‐identification algorithms are limited by the in...

Full description

Saved in:

Bibliographic Details
Published in	Protein science Vol. 32; no. 2; pp. e4555 - n/a
Main Authors	Chen, Jiaxiao, Gu, Zhonghui, Xu, Youjun, Deng, Minghua, Lai, Luhua, Pei, Jianfeng
Format	Journal Article
Language	English
Published	Hoboken, USA John Wiley & Sons, Inc 01.02.2023 Wiley Subscription Services, Inc
Subjects	Accuracy Algorithms Alliances Amino Acid Sequence Artificial neural networks binding site inference Binding Sites Computer applications deep learning Drug development druggable protein graph convolutional network Humans Identification methods Information processing Language Neural networks Neural Networks, Computer Protein structure Proteins Proteins - chemistry Residues sequence‐based Target recognition Therapeutic targets Tools for Protein Science transformer binding site inference graph convolutional network druggable protein transformer deep learning sequence-based
Online Access	Get full text
ISSN	0961-8368 1469-896X 1469-896X
DOI	10.1002/pro.4555

Cover

More Information
Summary:	The development of efficient computational methods for drug target protein identification can compensate for the high cost of experiments and is therefore of great significance for drug development. However, existing structure‐based drug target protein‐identification algorithms are limited by the insufficient number of proteins with experimentally resolved structures. Moreover, sequence‐based algorithms cannot effectively extract information from protein sequences and thus display insufficient accuracy. Here, we combined the sequence‐based self‐supervised pretraining protein language model ESM1b with a graph convolutional neural network classifier to develop an improved, sequence‐based drug target protein identification method. This complete model, named QuoteTarget, efficiently encodes proteins based on sequence information alone and achieves an accuracy of 95% with the nonredundant drug target and nondrug target datasets constructed for this study. When applied to all proteins from Homo sapiens, QuoteTarget identified 1213 potential undeveloped drug target proteins. We further inferred residue‐binding weights from the well‐trained network using the gradient‐weighted class activation mapping (Grad–Cam) algorithm. Notably, we found that without any binding site information input, significant residues inferred by the model closely match the experimentally confirmed drug molecule‐binding sites. Thus, our work provides a highly effective sequence‐based identifier for drug target proteins, as well to yield new insights into recognizing drug molecule‐binding sites. The entire model is available at https://github.com/Chenjxjx/drug-target-prediction.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Funding information Chinese Academy of Medical Sciences, Grant/Award Number: 2021‐I2M‐5‐014; National Natural Science Foundation of China, Grant/Award Number: 22033001 Review Editor: Nir Ben‐Tal
ISSN:	0961-8368 1469-896X 1469-896X
DOI:	10.1002/pro.4555