VC@Scale: Scalable and high-performance variant calling on cluster environments

Abstract Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a nee...

Full description

Saved in:

Bibliographic Details
Published in	Gigascience Vol. 10; no. 9
Main Authors	Ahmad, Tanveer, Al Ars, Zaid, Hofstee, H Peter
Format	Journal Article
Language	English
Published	United States Oxford University Press 07.09.2021
Subjects	Algorithms Big Data Central processing units Clusters Computer applications Computer memory CPUs Data processing Deep learning High performance computing High-Throughput Nucleotide Sequencing - methods Machine learning Next-generation sequencing Resource utilization Software Storage Technical Note Whole genome sequencing Workflow whole-genome sequencing sorting MarkDuplicate DeepVariant BWA-MEM Apache Spark Apache Arrow
Online Access	Get full text
ISSN	2047-217X 2047-217X
DOI	10.1093/gigascience/giab057

Cover

More Information
Summary:	Abstract Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. Results Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. Conclusions We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2047-217X 2047-217X
DOI:	10.1093/gigascience/giab057