Analyzing taxonomic classification using extensible Markov models
Motivation: As next generation sequencing is rapidly adding new genomes, their correct placement in the taxonomy needs verification. However, the current methods for confirming classification of a taxon or suggesting revision for a potential misplacement relies on computationally intense multi-seque...
Saved in:
| Published in | Bioinformatics Vol. 26; no. 18; pp. 2235 - 2241 |
|---|---|
| Main Authors | , , , , |
| Format | Journal Article |
| Language | English |
| Published |
Oxford
Oxford University Press
15.09.2010
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 1367-4803 1367-4811 1460-2059 1367-4811 |
| DOI | 10.1093/bioinformatics/btq349 |
Cover
| Summary: | Motivation: As next generation sequencing is rapidly adding new genomes, their correct placement in the taxonomy needs verification. However, the current methods for confirming classification of a taxon or suggesting revision for a potential misplacement relies on computationally intense multi-sequence alignment followed by an iterative adjustment of the distance matrix. Due to intra-heterogeneity issues with the 16S rRNA marker, no classifier is available for sub-genus level, which could readily suggest a classification for a novel 16S rRNA sequence. Metagenomics further complicates the issue by generating fragmented 16S rRNA sequences. This article proposes a novel alignment-free method for representing the microbial profiles using extensible Markov models (EMMs) with an extended Karlin–Altschul statistical framework similar to the classic alignment paradigm. We propose a log odds (LODs) score classifier based on Gumbel difference distribution that confirms correct classifications with statistical significance qualifications and suggests revisions where necessary. Results: We tested our method by generating a sub-genus level classifier with which we re-evaluated classifications of 676 microbial organisms using the NCBI FTP database for the 16S rRNA. The results confirm current classification for all genera while ascertaining significance at 95%. Furthermore, this novel classifier isolates heterogeneity issues to a mere 12 strains while confirming classifications with significance qualification for the remaining 98%. The models require less memory than that needed by multi-sequence alignments and have better time complexity than the current methods. The classifier operates at sub-genus level, and thus outperforms the naive Bayes classifier of the RNA Database Project where much of the taxonomic analysis is available online. Finally, using information redundancy in model building, we show that the method applies to metagenomic fragment classification of 19 Escherichia coli strains. Availability and implementation: Source code and binaries freely available for download at http://lyle.smu.edu/IDA/EMMSA/, implemented in JAVA and supported on MS Windows. Contact: mallik@kotamarti.com; mhd@lyle.smu.edu Supplementary information: Supplementary data are available at Bioinformatics online. |
|---|---|
| Bibliography: | ArticleID:btq349 istex:EBAE57D4525D412FFE083A49DA5C05D948C24DEC ark:/67375/HXZ-XK4F5W16-6 To whom correspondence should be addressed. Associate Editor: Burkhard Rost ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2 |
| ISSN: | 1367-4803 1367-4811 1460-2059 1367-4811 |
| DOI: | 10.1093/bioinformatics/btq349 |