Implicit indexing of natural language text by reorganizing bytecodes

Word-based byte-oriented compression has succeeded on large natural language text databases, by providing competitive compression ratios, fast random access, and direct sequential searching. We show that by just rearranging the target symbols of the compressed text into a tree-shaped structure, and...

Full description

Saved in:
Bibliographic Details
Published inInformation retrieval (Boston) Vol. 15; no. 6; pp. 527 - 557
Main Authors Brisaboa, Nieves R., Fariña, Antonio, Ladra, Susana, Navarro, Gonzalo
Format Journal Article
LanguageEnglish
Published Dordrecht Springer Netherlands 01.12.2012
Springer
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN1386-4564
1573-7659
1573-7659
DOI10.1007/s10791-012-9184-1

Cover

More Information
Summary:Word-based byte-oriented compression has succeeded on large natural language text databases, by providing competitive compression ratios, fast random access, and direct sequential searching. We show that by just rearranging the target symbols of the compressed text into a tree-shaped structure, and using negligible additional space, we obtain a new implicitly indexed representation of the compressed text, where search times are drastically improved. The occurrences of a word can be listed directly, without any text scanning, and in general any inverted-index-like capability, such as efficient phrase searches, can be emulated without storing any inverted list information. We experimentally show that our proposal performs not only much more efficiently than sequential searches over compressed text, but also than explicit inverted indexes and other types of indexes, when using little extra space. Our representation is especially successful when searching for single words and short phrases.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ObjectType-Article-2
ObjectType-Feature-1
content type line 23
ISSN:1386-4564
1573-7659
1573-7659
DOI:10.1007/s10791-012-9184-1