Encoded Expansion: An Efficient Algorithm to Discover Identical String Motifs

A major task in computational biology is the discovery of short recurring string patterns known as motifs. Most of the schemes to discover motifs are either stochastic or combinatorial in nature. Stochastic approaches do not guarantee finding the correct motifs, while the combinatorial schemes tend...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 9; no. 5; p. e95148
Main Authors	Azmi, Aqil M., Al-Ssulami, Abdulrakeeb
Format	Journal Article
Language	English
Published	United States Public Library of Science 28.05.2014 Public Library of Science (PLoS)
Subjects	Algorithms Analysis Base Sequence - genetics Binding sites Binding Sites - genetics Bioinformatics Biological effects Biology Biology and Life Sciences Coding Combinatorial analysis Complexity Computational Biology - methods Computer and Information Sciences Computer applications Computer science Data structures Deoxyribonucleic acid DNA Expert systems Information science Iterative methods Physical Sciences Sequence Analysis, DNA - methods Stochasticity Time Factors Transcription factors Saudi Arabia Riyadh Saudi Arabia
Online Access	Get full text
ISSN	1932-6203 1932-6203
DOI	10.1371/journal.pone.0095148

Cover

More Information
Summary:	A major task in computational biology is the discovery of short recurring string patterns known as motifs. Most of the schemes to discover motifs are either stochastic or combinatorial in nature. Stochastic approaches do not guarantee finding the correct motifs, while the combinatorial schemes tend to have an exponential time complexity with respect to motif length. To alleviate the cost, the combinatorial approach exploits dynamic data structures such as trees or graphs. Recently (Karci (2009) Efficient automatic exact motif discovery algorithms for biological sequences, Expert Systems with Applications 36:7952-7963) devised a deterministic algorithm that finds all the identical copies of string motifs of all sizes [Formula: see text] in theoretical time complexity of [Formula: see text] and a space complexity of [Formula: see text] where [Formula: see text] is the length of the input sequence and [Formula: see text] is the length of the longest possible string motif. In this paper, we present a significant improvement on Karci's original algorithm. The algorithm that we propose reports all identical string motifs of sizes [Formula: see text] that occur at least [Formula: see text] times. Our algorithm starts with string motifs of size 2, and at each iteration it expands the candidate string motifs by one symbol throwing out those that occur less than [Formula: see text] times in the entire input sequence. We use a simple array and data encoding to achieve theoretical worst-case time complexity of [Formula: see text] and a space complexity of [Formula: see text] Encoding of the substrings can speed up the process of comparison between string motifs. Experimental results on random and real biological sequences confirm that our algorithm has indeed a linear time complexity and it is more scalable in terms of sequence length than the existing algorithms.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Competing Interests: The authors have declared that no competing interests exist. Conceived and designed the experiments: AAS AMA. Performed the experiments: AAS. Analyzed the data: AAS AMA. Contributed reagents/materials/analysis tools: AAS. Wrote the paper: AMA.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0095148