FuzzyID2: A software package for large data set species identification via barcoding and metabarcoding using hidden Markov models and fuzzy set methods

Species identification through DNA barcoding or metabarcoding has become a key approach for biodiversity evaluation and ecological studies. However, the rapid accumulation of barcoding data has created some difficulties: for instance, global enquiries to a large reference library can take a very lon...

Full description

Saved in:

Bibliographic Details
Published in	Molecular ecology resources Vol. 18; no. 3; pp. 666 - 675
Main Authors	Shi, Zhi‐yong, Yang, Cai‐qing, Hao, Meng‐di, Wang, Xiao‐yang, Ward, Robert D., Zhang, Ai‐bing
Format	Journal Article
Language	English
Published	England Wiley Subscription Services, Inc 01.05.2018
Subjects	algorithms animals Bar codes barcoding Biodiversity Computer programs Computer simulation computer software data collection Datasets Deoxyribonucleic acid DNA DNA barcoding Ecological monitoring Ecological studies eDNA Environmental DNA fuzzy membership function Fuzzy sets Gene sequencing Genetic distance hidden Markov models high‐throughput sequencing (HTS) Identification Identification methods libraries Markov chain Markov chains metabarcoding Nucleotide sequence nucleotide sequences plant barcodes Searching Software Species species identification hidden Markov models metabarcoding DNA barcoding high-throughput sequencing (HTS) fuzzy membership function eDNA plant barcodes
Online Access	Get full text
ISSN	1755-098X 1755-0998 1755-0998
DOI	10.1111/1755-0998.12738

Cover

More Information
Summary:	Species identification through DNA barcoding or metabarcoding has become a key approach for biodiversity evaluation and ecological studies. However, the rapid accumulation of barcoding data has created some difficulties: for instance, global enquiries to a large reference library can take a very long time. We here devise a two‐step searching strategy to speed identification procedures of such queries. This firstly uses a Hidden Markov Model (HMM) algorithm to narrow the searching scope to genus level and then determines the corresponding species using minimum genetic distance. Moreover, using a fuzzy membership function, our approach also estimates the credibility of assignment results for each query. To perform this task, we developed a new software pipeline, FuzzyID2, using Python and C++. Performance of the new method was assessed using eight empirical data sets ranging from 70 to 234,535 barcodes. Five data sets (four animal, one plant) deployed the conventional barcode approach, one used metabarcodes, and two were eDNA‐based. The results showed mean accuracies of generic and species identification of 98.60% (with a minimum of 95.00% and a maximum of 100.00%) and 94.17% (with a range of 84.40%–100.00%), respectively. Tests with simulated NGS sequences based on realistic eDNA and metabarcode data demonstrated that FuzzyID2 achieved a significantly higher identification success rate than the commonly used Blast method, and the TIPP method tends to find many fewer species than either FuzztID2 or Blast. Furthermore, data sets with tens of thousands of barcodes need only a few seconds for each query assignment using FuzzyID2. Our approach provides an efficient and accurate species identification protocol for biodiversity‐related projects with large DNA sequence data sets.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1755-098X 1755-0998 1755-0998
DOI:	10.1111/1755-0998.12738