AbLang: an antibody language model for completing antibody sequences

Motivation General protein language models have been shown to summarize the semantics of protein sequences into representations that are useful for state-of-the-art predictive methods. However, for antibody specific problems, such as restoring residues lost due to sequencing errors, a model trained...

Full description

Saved in:
Bibliographic Details
Published inBioinformatics advances Vol. 2; no. 1; p. vbac046
Main Authors Olsen, Tobias H, Moal, Iain H, Deane, Charlotte M
Format Journal Article
LanguageEnglish
Published England Oxford University Press 2022
Subjects
Online AccessGet full text
ISSN2635-0041
2635-0041
DOI10.1093/bioadv/vbac046

Cover

More Information
Summary:Motivation General protein language models have been shown to summarize the semantics of protein sequences into representations that are useful for state-of-the-art predictive methods. However, for antibody specific problems, such as restoring residues lost due to sequencing errors, a model trained solely on antibodies may be more powerful. Antibodies are one of the few protein types where the volume of sequence data needed for such language models is available, e.g. in the Observed Antibody Space (OAS) database. Results Here, we introduce AbLang, a language model trained on the antibody sequences in the OAS database. We demonstrate the power of AbLang by using it to restore missing residues in antibody sequence data, a key issue with B-cell receptor repertoire sequencing, e.g. over 40% of OAS sequences are missing the first 15 amino acids. AbLang restores the missing residues of antibody sequences better than using IMGT germlines or the general protein language model ESM-1b. Further, AbLang does not require knowledge of the germline of the antibody and is seven times faster than ESM-1b. Availability and implementation AbLang is a python package available at https://github.com/oxpig/AbLang. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:2635-0041
2635-0041
DOI:10.1093/bioadv/vbac046