Indexing compressed text
We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P [1, p ] within a text T [1,...
Saved in:
| Published in | Journal of the ACM Vol. 52; no. 4; pp. 552 - 581 |
|---|---|
| Main Authors | , |
| Format | Journal Article |
| Language | English |
| Published |
New York, NY
Association for Computing Machinery
01.07.2005
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 0004-5411 1557-735X |
| DOI | 10.1145/1082036.1082039 |
Cover
| Summary: | We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P [1, p ] within a text T [1, n ] in O ( p + occ log 1+ε n ) time for any chosen ε, 0<ε<1. This data structure uses at most 5 n H k ( T ) + o ( n ) bits of storage, where H k ( T ) is the k th order empirical entropy of T . The space usage is Θ( n ) bits in the worst case and o ( n ) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows--Wheeler Transform, and can be regarded as a compressed suffix array .Our second compressed data structure achieves O ( p + occ ) query time using O ( n H k ( T )log ε n ) + o ( n ) bits of storage for any chosen ε, 0<ε<1. Therefore, it provides optimal output-sensitive query time using o ( n log n ) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm. |
|---|---|
| Bibliography: | SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23 ObjectType-Article-2 |
| ISSN: | 0004-5411 1557-735X |
| DOI: | 10.1145/1082036.1082039 |