An Efficient Parallel Sketch-Based Algorithmic Workflow for Mapping Long Reads
Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially...
Saved in:
Published in | IEEE Transactions on Computational Biology and Bioinformatics Vol. 22; no. 1; pp. 13 - 26 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
United States
IEEE
01.01.2025
|
Subjects | |
Online Access | Get full text |
ISSN | 2998-4165 2998-4165 |
DOI | 10.1109/TCBB.2024.3489478 |
Cover
Summary: | Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially constructed assemblies (or contigs) with newly sequenced long reads in order to generate genomic scaffolds. Under either setting, the main computational bottleneck is the step of mapping the long reads. While many tools implement the mapping step through overlap computations, designing alignment-free approaches is necessary for large-scale computations. In this paper, we visit the problem of mapping long reads to a database of subject sequences, in a fast and accurate manner. We present an efficient parallel algorithmic workflow, called JEM-mapper , that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads. For implementation and evaluation, we consider two application settings: (i) the hybrid scaffolding setting, which aims to map long reads to partial assemblies; and (ii) the classical long read assembly setting, which aims to map long reads to one another. We implemented an MPI+OpenMP version of JEM-mapper to enable parallelism at both distributed- and shared-memory layers. Experimental evaluation shows that JEM-mapper produces high-quality mapping while significantly improving the time to solution compared to state-of-the-art tools; e.g., in the hybrid setting for a large genome with 429 K HiFi long reads and 98K contigs, JEM-mapper produces a mapping with 99.41% precision and 97.91% recall, and 6.9x speedup over a state-of-the-art mapper. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 2998-4165 2998-4165 |
DOI: | 10.1109/TCBB.2024.3489478 |