Efficient construction of the BWT for repetitive text using string compression

We present a new semi-external algorithm that builds the Burrows–Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce computational costs when the input is massive and repetitive. Concretely, we build on induced suffix...

Full description

Saved in:
Bibliographic Details
Published inInformation and computation Vol. 294; p. 105088
Main Authors Díaz-Domínguez, Diego, Navarro, Gonzalo
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.10.2023
Subjects
Online AccessGet full text
ISSN0890-5401
1090-2651
DOI10.1016/j.ic.2023.105088

Cover

More Information
Summary:We present a new semi-external algorithm that builds the Burrows–Wheeler transform variant of Bauer et al. (a.k.a., BCR BWT) in linear expected time. Our method uses compression techniques to reduce computational costs when the input is massive and repetitive. Concretely, we build on induced suffix sorting (ISS) and resort to run-length and grammar compression to maintain our intermediate results in compact form. Our compression format not only saves space but also speeds up the required computations. Our experiments show important space and computation time savings when the text is repetitive. In moderate-size collections of real human genome assemblies (14.2 GB - 75.05 GB), our memory peak is, on average, 1.7x smaller than the peak of the state-of-the-art BCR BWT construction algorithm (ropebwt2), while running 5x faster. Our current implementation was also able to compute the BCR BWT of 400 real human genome assemblies (1.2 TB) in 41.21 hours using 118.83 GB of working memory (around 10% of the input size). Interestingly, the results we report in the 1.2 TB file are dominated by the difficulties of scanning huge files under memory constraints (specifically, I/O operations). This fact indicates we can perform much better with a more careful implementation of our method, thus scaling to even bigger sizes efficiently. •We introduce a new algorithm to build the Burrows-Wheeler Transform on massive and highly repetitive text collections.•We build on Induced Suffix Sorting and use grammar compression to store intermediate results.•Our experiments demonstrate that our particular format saves significant space and computation time.
ISSN:0890-5401
1090-2651
DOI:10.1016/j.ic.2023.105088