Fast Statistical Alignment

We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on...

Full description

Saved in:

Bibliographic Details
Published in	PLoS computational biology Vol. 5; no. 5; p. e1000392
Main Authors	Bradley, Robert K., Roberts, Adam, Smoot, Michael, Juvekar, Sudeep, Do, Jaeyoung, Dewey, Colin, Holmes, Ian, Pachter, Lior
Format	Journal Article
Language	English
Published	United States Public Library of Science 01.05.2009 Public Library of Science (PLoS)
Subjects	Accuracy Algorithms Amino Acid Sequence Animals Annealing Applications software Artificial Intelligence Base Sequence Bioinformatics Computational Biology/Comparative Sequence Analysis Computer Science Data Interpretation, Statistical Databases, Genetic Editing Genetic algorithms Genetics and Genomics/Bioinformatics Genomes Humans Markov Chains Mathematics Models, Genetic Molecular Sequence Data Parameter estimation Phylogenetics Proteins Sensitivity and Specificity Sequence Alignment - methods Sequence Analysis Software Statistical models Studies Teaching methods United States Amino Acid Sequence Data Interpretation, Statistical Markov Chains Sensitivity & Specificity Artificial Intelligence Humans Databases, Genetic Molecular Sequence Data Sequence Alignment Algorithms Animals Base Sequence Models, Genetic Software Sequence Analysis
Online Access	Get full text
ISSN	1553-7358 1553-734X 1553-7358
DOI	10.1371/journal.pcbi.1000392

Cover

More Information
Summary:	We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment--previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches--yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Wrote the paper: RKB CD LP. Led the development of FSA, wrote most of the code base, and developed the query-specific learning method: RKB. Redesigned the sequence annealing algorithm, constituted the core development team, and managed the project: RKB CD LP. Developed the GUI: AR. Developed a preliminary version of the GUI: MS. Developed the iterative refinement technique: SJ. Developed the parellelization and database modes: JD CD. Provided advice on the dart library, including its algorithms, programming and software components: IH. Created the FSA webserver: LP.
ISSN:	1553-7358 1553-734X 1553-7358
DOI:	10.1371/journal.pcbi.1000392