Computing the multi-string BWT and LCP array in external memory
Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows–Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on extern...
        Saved in:
      
    
          | Published in | Theoretical computer science Vol. 862; pp. 42 - 58 | 
|---|---|
| Main Authors | , , , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
            Elsevier B.V
    
        16.03.2021
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 0304-3975 1879-2294 1879-2294  | 
| DOI | 10.1016/j.tcs.2020.11.041 | 
Cover
| Summary: | Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows–Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes.
In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has O(mkl) time and I/O volume, using O(k+m) main memory, where l is the maximum value in the LCP array. | 
|---|---|
| ISSN: | 0304-3975 1879-2294 1879-2294  | 
| DOI: | 10.1016/j.tcs.2020.11.041 |