streammd: fast low-memory duplicate marking using a Bloom filter
The identification of duplicate reads is an essential pre processing step in short-read sequencing analysis. For large sequencing libraries this step is typically time-consuming and resource-intensive. Here we present streammd: a fast, memory-efficient, single-pass duplicate marking tool operating o...
Saved in:
| Published in | bioRxiv |
|---|---|
| Main Author | |
| Format | Paper |
| Language | English |
| Published |
Cold Spring Harbor
Cold Spring Harbor Laboratory Press
17.10.2022
Cold Spring Harbor Laboratory |
| Edition | 1.1 |
| Subjects | |
| Online Access | Get full text |
| ISSN | 2692-8205 2692-8205 |
| DOI | 10.1101/2022.10.12.511997 |
Cover
| Summary: | The identification of duplicate reads is an essential pre processing step in short-read sequencing analysis. For large sequencing libraries this step is typically time-consuming and resource-intensive. Here we present streammd: a fast, memory-efficient, single-pass duplicate marking tool operating on the principle of a Bloom filter. We show that streammd closely reproduces the outputs of Picard MarkDuplicates, a widely-used duplicate marking program, while being substantially faster and suitable for pipelined applications, and that it requires much less memory than SAMBLASTER, another single-pass duplicate marking tool. Competing Interest Statement The authors have declared no competing interest. Footnotes * https://github.com/delocalizer/streammd |
|---|---|
| Bibliography: | SourceType-Working Papers-1 ObjectType-Working Paper/Pre-Print-1 content type line 50 Competing Interest Statement: The authors have declared no competing interest. |
| ISSN: | 2692-8205 2692-8205 |
| DOI: | 10.1101/2022.10.12.511997 |