Generalized enhanced suffix array construction in external memory
Background Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure exc...
        Saved in:
      
    
          | Published in | Algorithms for molecular biology Vol. 12; no. 1; pp. 26 - 16 | 
|---|---|
| Main Authors | , , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        London
          BioMed Central
    
        07.12.2017
     BioMed Central Ltd BMC  | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 1748-7188 1748-7188  | 
| DOI | 10.1186/s13015-017-0117-9 | 
Cover
| Summary: | Background
Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure exceeds the available internal memory.
Results
In this article we present and analyze
eGSA
[introduced in CPM (External memory generalized suffix and
LCP
arrays construction. In: Proceedings of CPM. pp 201–10,
2013
)], the first external memory algorithm to construct generalized suffix arrays augmented with the longest common prefix array for a string collection. Our algorithm relies on a combination of buffers, induced sorting and a heap to avoid direct string comparisons. We performed experiments that covered different aspects of our algorithm, including running time, efficiency, external memory access, internal phases and the influence of different optimization strategies. On real datasets of size up to 24 GB and using 2 GB of internal memory,
eGSA
showed a competitive performance when compared to
eSAIS
and
SAscan
, which are efficient algorithms for a single string according to the related literature. We also show the effect of disk caching managed by the operating system on our algorithm.
Conclusions
The proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations, and showed a competitive performance. Our algorithm can also construct the generalized Burrows-Wheeler transform of a string collection with no additional cost except by the output time. | 
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23  | 
| ISSN: | 1748-7188 1748-7188  | 
| DOI: | 10.1186/s13015-017-0117-9 |