An Efficient Parallel Sketch-Based Algorithmic Workflow for Mapping Long Reads

Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially...

Full description

Saved in:
Bibliographic Details
Published inIEEE Transactions on Computational Biology and Bioinformatics Vol. 22; no. 1; pp. 13 - 26
Main Authors Rahman, Tazin, Bhowmik, Oieswarya, Kalyanaraman, Ananth
Format Journal Article
LanguageEnglish
Published United States IEEE 01.01.2025
Subjects
Online AccessGet full text
ISSN2998-4165
2998-4165
DOI10.1109/TCBB.2024.3489478

Cover

More Information
Summary:Long read technologies are continuing to evolve at a rapid pace, with the latest of the high fidelity technologies delivering reads over 10 Kbp with high accuracy. Classical long read assemblers produce assemblies directly from long reads. Hybrid assembly workflows provide ways to combine partially constructed assemblies (or contigs) with newly sequenced long reads in order to generate genomic scaffolds. Under either setting, the main computational bottleneck is the step of mapping the long reads. While many tools implement the mapping step through overlap computations, designing alignment-free approaches is necessary for large-scale computations. In this paper, we visit the problem of mapping long reads to a database of subject sequences, in a fast and accurate manner. We present an efficient parallel algorithmic workflow, called JEM-mapper , that uses a new minimizer-based Jaccard estimator (or JEM) sketch to perform alignment-free mapping of long reads. For implementation and evaluation, we consider two application settings: (i) the hybrid scaffolding setting, which aims to map long reads to partial assemblies; and (ii) the classical long read assembly setting, which aims to map long reads to one another. We implemented an MPI+OpenMP version of JEM-mapper to enable parallelism at both distributed- and shared-memory layers. Experimental evaluation shows that JEM-mapper produces high-quality mapping while significantly improving the time to solution compared to state-of-the-art tools; e.g., in the hybrid setting for a large genome with 429 K HiFi long reads and 98K contigs, JEM-mapper produces a mapping with 99.41% precision and 97.91% recall, and 6.9x speedup over a state-of-the-art mapper.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:2998-4165
2998-4165
DOI:10.1109/TCBB.2024.3489478