Unified Compression-Based Acceleration of Edit-Distance Computation

The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O ( N ) in O ( N 2 ) time....

Full description

Saved in:
Bibliographic Details
Published inAlgorithmica Vol. 65; no. 2; pp. 339 - 353
Main Authors Hermelin, Danny, Landau, Gad M., Landau, Shir, Weimann, Oren
Format Journal Article
LanguageEnglish
Published New York Springer-Verlag 01.02.2013
Springer
Subjects
Online AccessGet full text
ISSN0178-4617
1432-0541
DOI10.1007/s00453-011-9590-6

Cover

More Information
Summary:The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O ( N ) in O ( N 2 ) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o ( N 2 ) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs . These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n , we present an algorithm running in O ( nN lg( N / n )) time for computing the edit-distance of these two strings under any rational scoring function, and an O ( n 2/3 N 4/3 ) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario.
ISSN:0178-4617
1432-0541
DOI:10.1007/s00453-011-9590-6