Unified Compression-Based Acceleration of Edit-Distance Computation
The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O ( N ) in O ( N 2 ) time....
        Saved in:
      
    
          | Published in | Algorithmica Vol. 65; no. 2; pp. 339 - 353 | 
|---|---|
| Main Authors | , , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        New York
          Springer-Verlag
    
        01.02.2013
     Springer  | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 0178-4617 1432-0541  | 
| DOI | 10.1007/s00453-011-9590-6 | 
Cover
| Summary: | The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length
O
(
N
) in
O
(
N
2
) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings.
As it turns out, practically all known
o
(
N
2
) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using
straight line programs
. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods.
For two strings of total length
N
having straight-line program representations of total size
n
, we present an algorithm running in
O
(
nN
lg(
N
/
n
)) time for computing the edit-distance of these two strings under any rational scoring function, and an
O
(
n
2/3
N
4/3
) time algorithm for arbitrary scoring functions. Our new result, while providing a speed up for compressible strings, does not surpass the quadratic time bound even in the worst case scenario. | 
|---|---|
| ISSN: | 0178-4617 1432-0541  | 
| DOI: | 10.1007/s00453-011-9590-6 |