streammd: fast low-memory duplicate marking using a Bloom filter

The identification of duplicate reads is an essential pre processing step in short-read sequencing analysis. For large sequencing libraries this step is typically time-consuming and resource-intensive. Here we present streammd: a fast, memory-efficient, single-pass duplicate marking tool operating o...

Full description

Saved in:
Bibliographic Details
Published inbioRxiv
Main Author Leonard, Conrad R
Format Paper
LanguageEnglish
Published Cold Spring Harbor Cold Spring Harbor Laboratory Press 17.10.2022
Cold Spring Harbor Laboratory
Edition1.1
Subjects
Online AccessGet full text
ISSN2692-8205
2692-8205
DOI10.1101/2022.10.12.511997

Cover

More Information
Summary:The identification of duplicate reads is an essential pre processing step in short-read sequencing analysis. For large sequencing libraries this step is typically time-consuming and resource-intensive. Here we present streammd: a fast, memory-efficient, single-pass duplicate marking tool operating on the principle of a Bloom filter. We show that streammd closely reproduces the outputs of Picard MarkDuplicates, a widely-used duplicate marking program, while being substantially faster and suitable for pipelined applications, and that it requires much less memory than SAMBLASTER, another single-pass duplicate marking tool. Competing Interest Statement The authors have declared no competing interest. Footnotes * https://github.com/delocalizer/streammd
Bibliography:SourceType-Working Papers-1
ObjectType-Working Paper/Pre-Print-1
content type line 50
Competing Interest Statement: The authors have declared no competing interest.
ISSN:2692-8205
2692-8205
DOI:10.1101/2022.10.12.511997