Implicit indexing of natural language text by reorganizing bytecodes
Word-based byte-oriented compression has succeeded on large natural language text databases, by providing competitive compression ratios, fast random access, and direct sequential searching. We show that by just rearranging the target symbols of the compressed text into a tree-shaped structure, and...
Saved in:
| Published in | Information retrieval (Boston) Vol. 15; no. 6; pp. 527 - 557 |
|---|---|
| Main Authors | , , , |
| Format | Journal Article |
| Language | English |
| Published |
Dordrecht
Springer Netherlands
01.12.2012
Springer Springer Nature B.V |
| Subjects | |
| Online Access | Get full text |
| ISSN | 1386-4564 1573-7659 1573-7659 |
| DOI | 10.1007/s10791-012-9184-1 |
Cover
| Summary: | Word-based byte-oriented compression has succeeded on large natural language text databases, by providing competitive compression ratios, fast random access, and direct sequential searching. We show that by just rearranging the target symbols of the compressed text into a tree-shaped structure, and using negligible additional space, we obtain a new
implicitly indexed
representation of the compressed text, where search times are drastically improved. The occurrences of a word can be listed directly, without any text scanning, and in general any inverted-index-like capability, such as efficient phrase searches, can be emulated without storing any inverted list information. We experimentally show that our proposal performs not only much more efficiently than sequential searches over compressed text, but also than explicit inverted indexes and other types of indexes, when using little extra space. Our representation is especially successful when searching for single words and short phrases. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 ObjectType-Article-2 ObjectType-Feature-1 content type line 23 |
| ISSN: | 1386-4564 1573-7659 1573-7659 |
| DOI: | 10.1007/s10791-012-9184-1 |