Generic Non-recursive Suffix Array Construction
The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm...
Saved in:
| Published in | ACM transactions on algorithms Vol. 20; no. 2; pp. 1 - 42 |
|---|---|
| Main Authors | , , |
| Format | Journal Article |
| Language | English |
| Published |
New York, NY
ACM
13.04.2024
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 1549-6325 1549-6333 1549-6333 |
| DOI | 10.1145/3641854 |
Cover
| Summary: | The suffix array is arguably one of the most important data structures in sequence analysis and consequently there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in 2015 is the only known non-recursive linear-time suffix array construction algorithm (SACA). Despite its interesting theoretical properties, there has been little effort in improving GSACA’s non-competitive real-world performance. There is a super-linear algorithm DSH, which relies on the same sorting principle and is faster than DivSufSort, the fastest SACA for over a decade. The purpose of this article is twofold: We analyse the sorting principle used in GSACA and DSH and exploit its properties to give an optimised linear-time algorithm, and we show that it can be very elegantly used to compute both the original extended Burrows-Wheeler transform (eBWT) and a bijective version of the Burrows-Wheeler transform (BBWT) in linear time. We call the algorithm “generic,” since it can be used to compute the regular suffix array and the variants used for the BBWT and eBWT. Our suffix array construction algorithm is not only significantly faster than GSACA but also outperforms DivSufSort and DSH. Our BBWT-algorithm is faster than or competitive with all other tested BBWT construction implementations on large or repetitive data, and our eBWT-algorithm is faster than all other programs on data that is not extremely repetitive. |
|---|---|
| ISSN: | 1549-6325 1549-6333 1549-6333 |
| DOI: | 10.1145/3641854 |