Fast, Practical Algorithms for Computing All the Repeats in a String

Given a string x =  x [1.. n ] on an alphabet of size α , and a threshold p min ≥ 1, we describe four variants of an algorithm PSY1 that, using a suffix array, computes all the complete nonextendible repeats in x of length p ≥ p min . The basic algorithm PSY1–1 and its simple extension PSY1–2 are fa...

Full description

Saved in:
Bibliographic Details
Published inMathematics in computer science Vol. 3; no. 4; pp. 373 - 389
Main Authors Puglisi, Simon J., Smyth, W. F., Yusufu, Munina
Format Journal Article
LanguageEnglish
Published Basel Birkhäuser-Verlag 01.06.2010
Subjects
Online AccessGet full text
ISSN1661-8270
1661-8289
1661-8289
DOI10.1007/s11786-010-0033-6

Cover

More Information
Summary:Given a string x =  x [1.. n ] on an alphabet of size α , and a threshold p min ≥ 1, we describe four variants of an algorithm PSY1 that, using a suffix array, computes all the complete nonextendible repeats in x of length p ≥ p min . The basic algorithm PSY1–1 and its simple extension PSY1–2 are fast on strings that occur in biological, natural language and other applications (not highly periodic strings), while PSY1–3 guarantees Θ( n ) worst-case execution time. The final variant, PSY1–4, also achieves Θ( n ) processing time and, over the complete range of strings tested, is the fastest of the four. The space requirement of all four algorithms is about 5 n bytes, but all make use of the “longest common prefix” (LCP) array, whose construction requires about 6 n bytes. The four algorithms are faster in applications and use less space than a recently-proposed algorithm (Narisawa in Proceedings of 18th Annual Symposium on Combinatorial Pattern Matching, pp. 340–351, 2007) that produces equivalent output. The suffix array is not explicitly used by algorithms PSY1, but may be required for postprocessing; in this case, storage requirements rise to 9 n bytes. We also describe two variants of a fast Θ( n )-time algorithm PSY2 for computing all complete supernonextendible repeats in x .
ISSN:1661-8270
1661-8289
1661-8289
DOI:10.1007/s11786-010-0033-6