Alignment-free identification of COI DNA barcode data with the Python package Alfie

Characterization of biodiversity from environmental DNA samples and bulk metabarcoding data is hampered by off-target sequences that can confound conclusions about a taxonomic group of interest. Existing methods for isolation of target sequences rely on alignment to existing reference barcodes, but...

Full description

Saved in:
Bibliographic Details
Published inbioRxiv
Main Authors Nugent, Cameron M, Adamowicz, Sarah J
Format Paper
LanguageEnglish
Published Cold Spring Harbor Cold Spring Harbor Laboratory Press 29.06.2020
Cold Spring Harbor Laboratory
Edition1.1
Subjects
Online AccessGet full text
ISSN2692-8205
2692-8205
DOI10.1101/2020.06.29.177634

Cover

More Information
Summary:Characterization of biodiversity from environmental DNA samples and bulk metabarcoding data is hampered by off-target sequences that can confound conclusions about a taxonomic group of interest. Existing methods for isolation of target sequences rely on alignment to existing reference barcodes, but this can bias results against novel genetic variants. Effectively parsing targeted DNA barcode data from off-target noise improves the quality of biodiversity estimates and biological conclusions by limiting subsequent analyses to a relevant subset of available data. Here, we present Alfie, a Python package for the alignment-free classification of cytochrome c oxidase subunit I (COI) DNA barcode sequences to taxonomic kingdoms. The package determines k-mer frequencies of DNA sequences, and the frequencies serve as input for a neural network classifier that was trained and tested using ~58,000 publicly available COI sequences. The classifier was designed and optimized through a series of tests that allowed for the optimal set of DNA k-mer features and optimal machine learning algorithm to be selected. The neural network classifier rapidly assigns COI sequences to kingdoms with greater than 99% accuracy and is shown to generalize effectively and make accurate predictions about data from previously unseen taxonomic classes. The package contains an application programming interface that allows the Alfie package's functionality to be extended to different DNA sequence classification tasks to suit a user's need, including classification of different genes and barcodes, and classification to different taxonomic levels. Alfie is free and publicly available through GitHub (https://github.com/CNuge/alfie) and the Python package index (https://pypi.org/project/alfie/). Competing Interest Statement The authors have declared no competing interest. Footnotes * https://github.com/CNuge/alfie * https://github.com/CNuge/data-alfie
Bibliography:SourceType-Working Papers-1
ObjectType-Working Paper/Pre-Print-1
content type line 50
Competing Interest Statement: The authors have declared no competing interest.
ISSN:2692-8205
2692-8205
DOI:10.1101/2020.06.29.177634