A generic motif discovery algorithm for sequential data

Motivation: Motif discovery in sequential data is a problem of great interest and with many applications. However, previous methods have been unable to combine exhaustive search with complex motif representations and are each typically only applicable to a certain class of problems. Results: Here we...

Full description

Saved in:
Bibliographic Details
Published inBioinformatics Vol. 22; no. 1; pp. 21 - 28
Main Authors Jensen, Kyle L., Styczynski, Mark P., Rigoutsos, Isidore, Stephanopoulos, Gregory N.
Format Journal Article
LanguageEnglish
Published Oxford Oxford University Press 01.01.2006
Oxford Publishing Limited (England)
Subjects
Online AccessGet full text
ISSN1367-4803
1367-4811
1460-2059
1367-4811
DOI10.1093/bioinformatics/bti745

Cover

More Information
Summary:Motivation: Motif discovery in sequential data is a problem of great interest and with many applications. However, previous methods have been unable to combine exhaustive search with complex motif representations and are each typically only applicable to a certain class of problems. Results: Here we present a generic motif discovery algorithm (Gemoda) for sequential data. Gemoda can be applied to any dataset with a sequential character, including both categorical and real-valued data. As we show, Gemoda deterministically discovers motifs that are maximal in composition and length. As well, the algorithm allows any choice of similarity metric for finding motifs. Finally, Gemoda's output motifs are representation-agnostic: they can be represented using regular expressions, position weight matrices or any number of other models for any type of sequential data. We demonstrate a number of applications of the algorithm, including the discovery of motifs in amino acids sequences, a new solution to the (l,d)-motif problem in DNA sequences and the discovery of conserved protein substructures. Availability: Gemoda is freely available at Contact: gregstep@mit.edu Supplementary Information: Available at
Bibliography:istex:528CCA44B8B30F41476D9568A8BBFFB3EE786293
To whom correspondence should be addressed.
Associate Editor: Keith A Crandall
ark:/67375/HXZ-9HGS8S2L-P
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ObjectType-Article-2
ObjectType-Feature-1
content type line 23
ISSN:1367-4803
1367-4811
1460-2059
1367-4811
DOI:10.1093/bioinformatics/bti745