streammd: fast low-memory duplicate marking using a Bloom filter

The identification of duplicate reads is an essential pre processing step in short-read sequencing analysis. For large sequencing libraries this step is typically time-consuming and resource-intensive. Here we present streammd: a fast, memory-efficient, single-pass duplicate marking tool operating o...

Full description

Saved in:

Bibliographic Details
Published in	bioRxiv
Main Author	Leonard, Conrad R
Format	Paper
Language	English
Published	Cold Spring Harbor Cold Spring Harbor Laboratory Press 17.10.2022 Cold Spring Harbor Laboratory
Edition	1.1
Subjects	Bioinformatics Sequence analysis
Online Access	Get full text
ISSN	2692-8205 2692-8205
DOI	10.1101/2022.10.12.511997

Cover

More Information
Summary:	The identification of duplicate reads is an essential pre processing step in short-read sequencing analysis. For large sequencing libraries this step is typically time-consuming and resource-intensive. Here we present streammd: a fast, memory-efficient, single-pass duplicate marking tool operating on the principle of a Bloom filter. We show that streammd closely reproduces the outputs of Picard MarkDuplicates, a widely-used duplicate marking program, while being substantially faster and suitable for pipelined applications, and that it requires much less memory than SAMBLASTER, another single-pass duplicate marking tool. Competing Interest Statement The authors have declared no competing interest. Footnotes * https://github.com/delocalizer/streammd
Bibliography:	SourceType-Working Papers-1 ObjectType-Working Paper/Pre-Print-1 content type line 50 Competing Interest Statement: The authors have declared no competing interest.
ISSN:	2692-8205 2692-8205
DOI:	10.1101/2022.10.12.511997