Analyzing taxonomic classification using extensible Markov models

Motivation: As next generation sequencing is rapidly adding new genomes, their correct placement in the taxonomy needs verification. However, the current methods for confirming classification of a taxon or suggesting revision for a potential misplacement relies on computationally intense multi-seque...

Full description

Saved in:

Bibliographic Details
Published in	Bioinformatics Vol. 26; no. 18; pp. 2235 - 2241
Main Authors	Kotamarti, Rao M., Hahsler, Michael, Raiford, Douglas, McGee, Monnie, Dunham, Margaret H.
Format	Journal Article
Language	English
Published	Oxford Oxford University Press 15.09.2010
Subjects	Algorithms Biological and medical sciences Escherichia coli Fundamental and applied biological sciences. Psychology General aspects Markov Chains Mathematics in biology. Statistical analysis. Models. Metrology. Data processing in biology (general aspects) Models, Biological Phylogeny Proteobacteria - classification RNA, Ribosomal, 16S Sequence Alignment Classification Markov model
Online Access	Get full text
ISSN	1367-4803 1367-4811 1460-2059 1367-4811
DOI	10.1093/bioinformatics/btq349

Cover

More Information
Summary:	Motivation: As next generation sequencing is rapidly adding new genomes, their correct placement in the taxonomy needs verification. However, the current methods for confirming classification of a taxon or suggesting revision for a potential misplacement relies on computationally intense multi-sequence alignment followed by an iterative adjustment of the distance matrix. Due to intra-heterogeneity issues with the 16S rRNA marker, no classifier is available for sub-genus level, which could readily suggest a classification for a novel 16S rRNA sequence. Metagenomics further complicates the issue by generating fragmented 16S rRNA sequences. This article proposes a novel alignment-free method for representing the microbial profiles using extensible Markov models (EMMs) with an extended Karlin–Altschul statistical framework similar to the classic alignment paradigm. We propose a log odds (LODs) score classifier based on Gumbel difference distribution that confirms correct classifications with statistical significance qualifications and suggests revisions where necessary. Results: We tested our method by generating a sub-genus level classifier with which we re-evaluated classifications of 676 microbial organisms using the NCBI FTP database for the 16S rRNA. The results confirm current classification for all genera while ascertaining significance at 95%. Furthermore, this novel classifier isolates heterogeneity issues to a mere 12 strains while confirming classifications with significance qualification for the remaining 98%. The models require less memory than that needed by multi-sequence alignments and have better time complexity than the current methods. The classifier operates at sub-genus level, and thus outperforms the naive Bayes classifier of the RNA Database Project where much of the taxonomic analysis is available online. Finally, using information redundancy in model building, we show that the method applies to metagenomic fragment classification of 19 Escherichia coli strains. Availability and implementation: Source code and binaries freely available for download at http://lyle.smu.edu/IDA/EMMSA/, implemented in JAVA and supported on MS Windows. Contact: mallik@kotamarti.com; mhd@lyle.smu.edu Supplementary information: Supplementary data are available at Bioinformatics online.
Bibliography:	ArticleID:btq349 istex:EBAE57D4525D412FFE083A49DA5C05D948C24DEC ark:/67375/HXZ-XK4F5W16-6 To whom correspondence should be addressed. Associate Editor: Burkhard Rost ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2
ISSN:	1367-4803 1367-4811 1460-2059 1367-4811
DOI:	10.1093/bioinformatics/btq349