Generic Non-recursive Suffix Array Construction

The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm...

Full description

Saved in:
Bibliographic Details
Published inACM transactions on algorithms Vol. 20; no. 2; pp. 1 - 42
Main Authors Olbrich, Jannik, Ohlebusch, Enno, Büchler, Thomas
Format Journal Article
LanguageEnglish
Published New York, NY ACM 13.04.2024
Subjects
Online AccessGet full text
ISSN1549-6325
1549-6333
1549-6333
DOI10.1145/3641854

Cover

More Information
Summary:The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm (SACA). Despite its interesting theoretical properties, there has been little effort in improving GSACA’s non-competitive real-world performance. There is a super-linear algorithm DSH, which relies on the same sorting principle and is faster than DivSufSort, the fastest SACA for over a decade. The purpose of this article is twofold: We analyse the sorting principle used in GSACA and DSH and exploit its properties to give an optimised linear-time algorithm, and we show that it can be very elegantly used to compute both the original extended Burrows-Wheeler transform (eBWT) and a bijective version of the Burrows-Wheeler transform (BBWT) in linear time. We call the algorithm “generic,” since it can be used to compute the regular suffix array and the variants used for the BBWT and eBWT. Our suffix array construction algorithm is not only significantly faster than GSACA but also outperforms DivSufSort and DSH. Our BBWT-algorithm is faster than or competitive with all other tested BBWT construction implementations on large or repetitive data, and our eBWT-algorithm is faster than all other programs on data that is not extremely repetitive.
ISSN:1549-6325
1549-6333
1549-6333
DOI:10.1145/3641854