A k-mer Based Approach for SARS-CoV-2 Variant Identification

With the rapid spread of the novel coronavirus (COVID-19) across the globe and its continuous mutation, it is of pivotal importance to design a system to identify different known (and unknown) variants of SARS-CoV-2. Identifying particular variants helps to understand and model their spread patterns...

Full description

Saved in:

Bibliographic Details
Published in	Bioinformatics Research and Applications Vol. 13064; pp. 153 - 164
Main Authors	Ali, Sarwan, Sahoo, Bikram, Ullah, Naimat, Zelikovskiy, Alexander, Patterson, Murray, Khan, Imdadullah
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2021 Springer International Publishing
Series	Lecture Notes in Computer Science
Online Access	Get full text
ISBN	9783030914141 3030914143
ISSN	0302-9743 1611-3349
DOI	10.1007/978-3-030-91415-8_14

Cover

More Information
Summary:	With the rapid spread of the novel coronavirus (COVID-19) across the globe and its continuous mutation, it is of pivotal importance to design a system to identify different known (and unknown) variants of SARS-CoV-2. Identifying particular variants helps to understand and model their spread patterns, design effective mitigation strategies, and prevent future outbreaks. It also plays a crucial role in studying the efficacy of known vaccines against each variant, and modeling the likelihood of breakthrough infections. It is well known that the spike protein contains most of the information/variation pertaining to coronavirus variants. In this paper, we use spike sequences to classify different variants of the human SARS-CoV-2. We show that preserving order information of the amino acids helps the underlying classifiers to achieve better performance. We also show that we can train our model to outperform the baseline algorithms using only a small number of training samples (1% $$1\%$$ of the data). Finally, we show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA’s Centers for Disease Control and Prevention (CDC).
Bibliography:	Original Abstract: With the rapid spread of the novel coronavirus (COVID-19) across the globe and its continuous mutation, it is of pivotal importance to design a system to identify different known (and unknown) variants of SARS-CoV-2. Identifying particular variants helps to understand and model their spread patterns, design effective mitigation strategies, and prevent future outbreaks. It also plays a crucial role in studying the efficacy of known vaccines against each variant, and modeling the likelihood of breakthrough infections. It is well known that the spike protein contains most of the information/variation pertaining to coronavirus variants. In this paper, we use spike sequences to classify different variants of the human SARS-CoV-2. We show that preserving order information of the amino acids helps the underlying classifiers to achieve better performance. We also show that we can train our model to outperform the baseline algorithms using only a small number of training samples (1%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1\%$$\end{document} of the data). Finally, we show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA’s Centers for Disease Control and Prevention (CDC).
ISBN:	9783030914141 3030914143
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-030-91415-8_14