streammd: fast low-memory duplicate marking using a Bloom filter
The identification of duplicate reads is an essential pre processing step in short-read sequencing analysis. For large sequencing libraries this step is typically time-consuming and resource-intensive. Here we present streammd: a fast, memory-efficient, single-pass duplicate marking tool operating o...
        Saved in:
      
    
          | Published in | bioRxiv | 
|---|---|
| Main Author | |
| Format | Paper | 
| Language | English | 
| Published | 
        Cold Spring Harbor
          Cold Spring Harbor Laboratory Press
    
        17.10.2022
     Cold Spring Harbor Laboratory  | 
| Edition | 1.1 | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 2692-8205 2692-8205  | 
| DOI | 10.1101/2022.10.12.511997 | 
Cover
| Summary: | The identification of duplicate reads is an essential pre processing step in short-read sequencing analysis. For large sequencing libraries this step is typically time-consuming and resource-intensive. Here we present streammd: a fast, memory-efficient, single-pass duplicate marking tool operating on the principle of a Bloom filter. We show that streammd closely reproduces the outputs of Picard MarkDuplicates, a widely-used duplicate marking program, while being substantially faster and suitable for pipelined applications, and that it requires much less memory than SAMBLASTER, another single-pass duplicate marking tool. Competing Interest Statement The authors have declared no competing interest. Footnotes * https://github.com/delocalizer/streammd | 
|---|---|
| Bibliography: | SourceType-Working Papers-1 ObjectType-Working Paper/Pre-Print-1 content type line 50 Competing Interest Statement: The authors have declared no competing interest.  | 
| ISSN: | 2692-8205 2692-8205  | 
| DOI: | 10.1101/2022.10.12.511997 |