VC@Scale: Scalable and high-performance variant calling on cluster environments

Abstract Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a nee...

Full description

Saved in:

Bibliographic Details
Published in	Gigascience Vol. 10; no. 9
Main Authors	Ahmad, Tanveer, Al Ars, Zaid, Hofstee, H Peter
Format	Journal Article
Language	English
Published	United States Oxford University Press 07.09.2021
Subjects	Algorithms Big Data Central processing units Clusters Computer applications Computer memory CPUs Data processing Deep learning High performance computing High-Throughput Nucleotide Sequencing - methods Machine learning Next-generation sequencing Resource utilization Software Storage Technical Note Whole genome sequencing Workflow whole-genome sequencing sorting MarkDuplicate DeepVariant BWA-MEM Apache Spark Apache Arrow
Online Access	Get full text
ISSN	2047-217X 2047-217X
DOI	10.1093/gigascience/giab057

Cover

Abstract	Abstract Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. Results Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. Conclusions We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.
AbstractList	Recently many new deep learning-based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow's columnar in-memory data transformations. Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale. Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. Results Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. Conclusions We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale. Recently many new deep learning-based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow's columnar in-memory data transformations.BACKGROUNDRecently many new deep learning-based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow's columnar in-memory data transformations.Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters.RESULTSHere we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters.We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.CONCLUSIONSWe show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale. Abstract Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. Results Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. Conclusions We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.
Author	Al Ars, Zaid Ahmad, Tanveer Hofstee, H Peter
Author_xml	– sequence: 1 givenname: Tanveer orcidid: 0000-0003-0519-2315 surname: Ahmad fullname: Ahmad, Tanveer email: t.ahmad@tudelft.nl – sequence: 2 givenname: Zaid orcidid: 0000-0001-7670-8572 surname: Al Ars fullname: Al Ars, Zaid – sequence: 3 givenname: H Peter orcidid: 0000-0001-9649-7338 surname: Hofstee fullname: Hofstee, H Peter
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/34494101$$D View this record in MEDLINE/PubMed
BookMark	eNqNkUtr3DAUhUVJaB7NLygUQzfdONXLltxFSRnSBwSy6IPuhCxfexRsyZXsKfn31eBJO8kijTb3gr5z79HRCTpw3gFCLwk-J7hibzvb6WgsOAOp1zUuxDN0TDEXOSXi58Fef4TOYrzB6QghpWDP0RHjvOIEk2N0_WN18dXoHt5l26LrHjLtmmxtu3U-Qmh9GHRakm10sNpNWYJ667rMu8z0c5wgZOA2Nng3gJviC3TY6j7C2a6eou8fL7-tPudX15--rD5c5abAeMqTEYNbjhtMMKWVJKyqCVSsII1IxsoairrCUkBpDJO6bDWVFGTJatIwTAg7RXyZO7tR3_5OptQY7KDDrSJYbSNSexGpXURJ9n6RjXM9QGOS5aD_Sb226v6Ns2vV-Y2SnPJlwJvdgOB_zRAnNdhooO-1Az9HRQuBmeCU4YS-foDe-Dm4lIpipKwEZ0LSxygqSCFJwcWWerXv-6_hu49MAFsAE3yMAdonxlE9UBk76cn67eNt_x_t-aL18_ikZX8AxlnbDg
CitedBy_id	crossref_primary_10_1093_bioinformatics_btac804 crossref_primary_10_1093_gigascience_giab057 crossref_primary_10_1016_j_heliyon_2023_e13368
Cites_doi	10.4172/2329-9533.1000101 10.1038/s41467-019-09027-x 10.1101/667261 10.1093/nar/gkr599 10.1093/bioinformatics/btv098 10.1038/nmeth.1923 10.1093/nar/gkw227 10.1093/nar/gks918 10.1038/nbt.2514 10.1038/s41587-021-00861-3 10.1101/gr.129684.111 10.1093/bioinformatics/btp352 10.3390/genes10110886 10.1371/journal.pone.0155461 10.1371/journal.pone.0086869 10.1093/bioinformatics/btu314 10.1093/gigascience/giab057 10.1093/bioinformatics/btv179 10.1177/1094342004046045 10.1093/bioinformatics/btp324 10.1371/journal.pone.0163962 10.1038/s41592-018-0051-x 10.1038/nbt.4235
ContentType	Journal Article
Copyright	The Author(s) 2021. Published by Oxford University Press GigaScience. 2021 The Author(s) 2021. Published by Oxford University Press GigaScience. The Author(s) 2021. Published by Oxford University Press GigaScience. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml	– notice: The Author(s) 2021. Published by Oxford University Press GigaScience. 2021 – notice: The Author(s) 2021. Published by Oxford University Press GigaScience. – notice: The Author(s) 2021. Published by Oxford University Press GigaScience. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID	TOX AAYXX CITATION CGR CUY CVF ECM EIF NPM JQ2 K9. 7X8 5PM ADTOC UNPAY
DOI	10.1093/gigascience/giab057
DatabaseName	Oxford Journals Open Access Collection CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed ProQuest Computer Science Collection ProQuest Health & Medical Complete (Alumni) MEDLINE - Academic PubMed Central (Full Participant titles) Unpaywall for CDI: Periodical Content Unpaywall
DatabaseTitle	CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) ProQuest Health & Medical Complete (Alumni) ProQuest Computer Science Collection MEDLINE - Academic
DatabaseTitleList	MEDLINE ProQuest Health & Medical Complete (Alumni) MEDLINE - Academic ProQuest Health & Medical Complete (Alumni)
Database_xml	– sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database – sequence: 3 dbid: TOX name: Oxford Journals Open Access Collection url: https://academic.oup.com/journals/ sourceTypes: Publisher – sequence: 4 dbid: UNPAY name: Unpaywall url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/ sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
Discipline	Library & Information Science
EISSN	2047-217X
ExternalDocumentID	10.1093/gigascience/giab057 PMC8424057 34494101 10_1093_gigascience_giab057
Genre	Research Support, Non-U.S. Gov't Journal Article
GrantInformation_xml	– fundername: ;
GroupedDBID	0R~ 4.4 53G 5VS 7X7 88E 88I 8FE 8FG 8FH 8FI 8FJ AAFWJ AAHBH AAPXW AAVAP ABDBF ABEJV ABGNP ABPTD ABUWG ABXVV ACGFS ACPRK ACUHS ADBBV ADRAZ ADUKV AEGXH AENZO AFKRA AFPKN AHBYD AHSBF AHYZX ALIPV ALMA_UNASSIGNED_HOLDINGS AMNDL AOIJS ARAPS AZQEC BAWUL BAYMD BBNVY BCNDV BENPR BFQNJ BGLVJ BHPHI BMC BPHCQ BVXVI C6C CCPQU DIK DWQXO EBS EJD FYUFA GNUQQ GROUPED_DOAJ GX1 H13 HCIFZ HMCUK HYE IAO IGS IHR IHW INH INR IPNFZ ITC K6V K7- KQ8 KSI LK8 M1P M2P M48 M7P M~E O9- OK1 P62 PHGZT PIMPY PQQKQ PROAC PSQYO RBZ RIG RNS ROL RPM RSV SBL SOJ TJX TOX UKHRP AAYXX CITATION CGR CUY CVF ECM EIF NPM JQ2 K9. 7X8 5PM ADTOC PHGZM PJZUB PPXIY PQGLB UNPAY
ID	FETCH-LOGICAL-c500t-788c0f40d0102298139b1e9351d74106be5b9087e6cc38a6fa282e863b1d30113
IEDL.DBID	M48
ISSN	2047-217X
IngestDate	Sun Oct 26 03:53:26 EDT 2025 Thu Aug 21 13:34:22 EDT 2025 Thu Oct 02 04:14:57 EDT 2025 Tue Oct 07 06:41:11 EDT 2025 Tue Oct 07 06:29:30 EDT 2025 Mon Jul 21 06:02:04 EDT 2025 Wed Oct 01 02:41:43 EDT 2025 Thu Apr 24 22:53:12 EDT 2025 Wed Apr 02 07:04:38 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	9
Keywords	whole-genome sequencing sorting MarkDuplicate DeepVariant BWA-MEM Apache Spark Apache Arrow
Language	English
License	This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. https://creativecommons.org/licenses/by/4.0 The Author(s) 2021. Published by Oxford University Press GigaScience. cc-by
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c500t-788c0f40d0102298139b1e9351d74106be5b9087e6cc38a6fa282e863b1d30113
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ORCID	0000-0003-0519-2315 0000-0001-7670-8572 0000-0001-9649-7338
OpenAccessLink	http://journals.scholarsportal.info/openUrl.xqy?doi=10.1093/gigascience/giab057
PMID	34494101
PQID	2715815472
PQPubID	2040230
ParticipantIDs	unpaywall_primary_10_1093_gigascience_giab057 pubmedcentral_primary_oai_pubmedcentral_nih_gov_8424057 proquest_miscellaneous_2570374230 proquest_journals_3169743782 proquest_journals_2715815472 pubmed_primary_34494101 crossref_primary_10_1093_gigascience_giab057 crossref_citationtrail_10_1093_gigascience_giab057 oup_primary_10_1093_gigascience_giab057
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	20210907
PublicationDateYYYYMMDD	2021-09-07
PublicationDate_xml	– month: 9 year: 2021 text: 20210907 day: 7
PublicationDecade	2020
PublicationPlace	United States
PublicationPlace_xml	– name: United States – name: Oxford
PublicationTitle	Gigascience
PublicationTitleAlternate	Gigascience
PublicationYear	2021
Publisher	Oxford University Press
Publisher_xml	– name: Oxford University Press
References	Ahmad (2024111605050385500_bib48) 2021 Apache (2024111605050385500_bib37) 2019 Zhang (2024111605050385500_bib10) 2019; 10 Shen (2024111605050385500_bib36) 2016; 11 Cooke (2024111605050385500_bib23) 2021; 39 Apache (2024111605050385500_bib34) 2019 Garrison (2024111605050385500_bib24) 2012 (ENA) TENA (2024111605050385500_bib41) 2020 Apache (2024111605050385500_bib33) 2019 Carroll (2024111605050385500_bib47) 2017 Sahraeian (2024111605050385500_bib22) 2019 Krusche (2024111605050385500_bib49) 2021 Darling (2024111605050385500_bib31) 2003; 2003 Lustre (2024111605050385500_bib45) 2020 SurfSara (2024111605050385500_bib44) 2020 Illumina (2024111605050385500_bib40) 2012 Picard toolkit (2024111605050385500_bib14) Mushtaq (2024111605050385500_bib6) 2017 Wilm (2024111605050385500_bib27) 2012; 40 Li (2024111605050385500_bib12) 2009; 25 Apache (2024111605050385500_bib5) 2019 GIAB (2024111605050385500_bib42) 2020 Sahraeian (2024111605050385500_bib21) 2019; 10 Kim (2024111605050385500_bib25) 2018; 15 Slurm (2024111605050385500_bib46) 2020 FDA (2024111605050385500_bib38) 2019 Faust (2024111605050385500_bib16) 2014; 30 Broad Institute (2024111605050385500_bib9) 2018 Cappello (2024111605050385500_bib2) 2014; 1 Luo (2024111605050385500_bib30) 2012; 1 Wei (2024111605050385500_bib26) 2011; 39 FDA (2024111605050385500_bib28) 2019 Tarasov (2024111605050385500_bib15) 2015; 31 Lai (2024111605050385500_bib19) 2016; 44 Poplin (2024111605050385500_bib17) 2018; 36 2024111605050385500_bib50 Langmead (2024111605050385500_bib11) 2012; 9 Cibulskis (2024111605050385500_bib20) 2013; 31 Abuín (2024111605050385500_bib8) 2016; 11 Jin (2024111605050385500_bib35) 2018 FDA (2024111605050385500_bib29) 2019 UCSC (2024111605050385500_bib43) 2020 Gropp (2024111605050385500_bib1) 2004; 18 Koboldt (2024111605050385500_bib18) 2012; 22 UCSC (2024111605050385500_bib39) 2018 Massie (2024111605050385500_bib7) 2013 Li (2024111605050385500_bib13) 2009; 1 Decap (2024111605050385500_bib4) 2015; 31 Apache Apache Hadoop (2024111605050385500_bib3) 2019 Liu (2024111605050385500_bib32) 2014; 9
References_xml	– volume: 1 issue: 1 year: 2012 ident: 2024111605050385500_bib30 article-title: Speeding up large-scale next generation sequencing data analysis with pBWA publication-title: J Appl Bioinform Comput Biol doi: 10.4172/2329-9533.1000101 – year: 2020 ident: 2024111605050385500_bib42 article-title: NHGRI Illumina 300X BAM – volume: 10 start-page: 1041 issue: 1 year: 2019 ident: 2024111605050385500_bib21 article-title: Deep convolutional neural networks for accurate somatic mutation detection publication-title: Nat Commun doi: 10.1038/s41467-019-09027-x – year: 2020 ident: 2024111605050385500_bib41 article-title: Illumina 30X – year: 2019 ident: 2024111605050385500_bib22 article-title: Robust cancer mutation detection with deep learning models derived from tumor-normal sequencing data doi: 10.1101/667261 – year: 2019 ident: 2024111605050385500_bib34 article-title: PySpark Usage Guide for Pandas with Apache Arrow – volume: 39 start-page: e132 issue: 19 year: 2011 ident: 2024111605050385500_bib26 article-title: SNVer: A statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data publication-title: Nucleic Acids Res doi: 10.1093/nar/gkr599 – year: 2019 ident: 2024111605050385500_bib38 article-title: precisionFDA: A community platform for NGS assay evaluation and regulatory science exploration – volume: 31 start-page: 2032 issue: 12 year: 2015 ident: 2024111605050385500_bib15 article-title: Sambamba: Fast processing of NGS alignment formats publication-title: Bioinformatics doi: 10.1093/bioinformatics/btv098 – volume: 9 start-page: 357 issue: 4 year: 2012 ident: 2024111605050385500_bib11 article-title: Fast gapped-read alignment with Bowtie 2 publication-title: Nat Methods doi: 10.1038/nmeth.1923 – volume: 44 start-page: e108 issue: 11 year: 2016 ident: 2024111605050385500_bib19 article-title: VarDict: A novel and versatile variant caller for next-generation sequencing in cancer research publication-title: Nucleic Acids Res doi: 10.1093/nar/gkw227 – volume: 40 start-page: 11189 issue: 22 year: 2012 ident: 2024111605050385500_bib27 article-title: LoFreq: A sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets publication-title: Nucleic Acids Res doi: 10.1093/nar/gks918 – volume: 2003 year: 2003 ident: 2024111605050385500_bib31 article-title: The design, implementation, and evaluation of mpiBLAST publication-title: Proc Cluster World – ident: 2024111605050385500_bib14 article-title: Broad Institute – volume: 31 start-page: 213 year: 2013 ident: 2024111605050385500_bib20 article-title: Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples publication-title: Nat Biotechnol doi: 10.1038/nbt.2514 – year: 2019 ident: 2024111605050385500_bib5 article-title: Apache Spark: Lightning-fast unified analytics engine – volume: 39 start-page: 885 year: 2021 ident: 2024111605050385500_bib23 article-title: A unified haplotype-based method for accurate and comprehensive variant calling publication-title: Nat Biotechnol doi: 10.1038/s41587-021-00861-3 – year: 2018 ident: 2024111605050385500_bib39 article-title: faSplit – volume: 22 start-page: 568 issue: 3 year: 2012 ident: 2024111605050385500_bib18 article-title: VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing publication-title: Genome Res doi: 10.1101/gr.129684.111 – year: 2019 ident: 2024111605050385500_bib37 article-title: Plasma In-Memory Object Store – year: 2020 ident: 2024111605050385500_bib46 article-title: Slurm workload manager – volume: 1 start-page: 2078 issue: 25 year: 2009 ident: 2024111605050385500_bib13 article-title: The Sequence Alignment/Map format and SAMtools publication-title: Bioinformatics doi: 10.1093/bioinformatics/btp352 – year: 2021 ident: 2024111605050385500_bib49 article-title: Haplotype VCF comparison tools – year: 2018 ident: 2024111605050385500_bib9 article-title: BWA on Spark – year: 2012 ident: 2024111605050385500_bib40 article-title: Illumina Cambridge Ltd – volume: 10 start-page: 886 issue: 11 year: 2019 ident: 2024111605050385500_bib10 article-title: PipeMEM: A framework to speed up BWA-MEM in Spark with low overhead publication-title: Genes doi: 10.3390/genes10110886 – year: 2019 ident: 2024111605050385500_bib29 article-title: PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions – year: 2013 ident: 2024111605050385500_bib7 article-title: ADAM: Genomics formats and processing patterns for cloud scale computing – start-page: 148 volume-title: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB ’17, Boston, MA, USA year: 2017 ident: 2024111605050385500_bib6 article-title: SparkGA: A Spark framework for cost effective, fast and accurate DNA analysis at scale – volume: 11 start-page: 1 issue: 5 year: 2016 ident: 2024111605050385500_bib8 article-title: SparkBWA: Speeding up the alignment of high-throughput DNA sequencing data publication-title: PLoS One doi: 10.1371/journal.pone.0155461 – volume: 9 issue: 1 year: 2014 ident: 2024111605050385500_bib32 article-title: CUSHAW3: Sensitive and accurate base-space and color-space short-read alignment with hybrid seeding publication-title: PLoS One doi: 10.1371/journal.pone.0086869 – year: 2018 ident: 2024111605050385500_bib35 article-title: Introducing Pandas UDF for PySpark – year: 2019 ident: 2024111605050385500_bib33 article-title: Apache Arrow: A cross-language development platform for in-memory data – volume: 30 start-page: 2503 issue: 17 year: 2014 ident: 2024111605050385500_bib16 article-title: SAMBLASTER: Fast duplicate marking and structural variant read extraction publication-title: Bioinformatics doi: 10.1093/bioinformatics/btu314 – year: 2012 ident: 2024111605050385500_bib24 article-title: Haplotype-based variant detection from short-read sequencing – ident: 2024111605050385500_bib50 doi: 10.1093/gigascience/giab057 – volume: 31 start-page: 2482 issue: 15 year: 2015 ident: 2024111605050385500_bib4 article-title: Halvade: scalable sequence analysis with MapReduce publication-title: Bioinformatics doi: 10.1093/bioinformatics/btv179 – year: 2019 ident: 2024111605050385500_bib3 – year: 2021 ident: 2024111605050385500_bib48 article-title: Standalone pre-processing on clusters – volume: 18 start-page: 363 issue: 3 year: 2004 ident: 2024111605050385500_bib1 article-title: Fault tolerance in message passing interface programs publication-title: Int J High Perform Comput Appl doi: 10.1177/1094342004046045 – year: 2020 ident: 2024111605050385500_bib44 article-title: Cartesius: the Dutch supercomputer – volume: 25 start-page: 1754 issue: 14 year: 2009 ident: 2024111605050385500_bib12 article-title: Fast and accurate short read alignment with Burrows–Wheeler transform publication-title: Bioinformatics doi: 10.1093/bioinformatics/btp324 – volume: 11 issue: 10 year: 2016 ident: 2024111605050385500_bib36 article-title: SeqKit: A cross-platform and ultrafast toolkit for FASTA/Q file manipulation publication-title: PLoS One doi: 10.1371/journal.pone.0163962 – year: 2019 ident: 2024111605050385500_bib28 article-title: PrecisionFDA Truth Challenge – year: 2020 ident: 2024111605050385500_bib43 article-title: UCSC hg19 (GRCh37) – volume: 15 start-page: 591 issue: 8 year: 2018 ident: 2024111605050385500_bib25 article-title: Strelka2: fast and accurate calling of germline and somatic variants publication-title: Nat Methods doi: 10.1038/s41592-018-0051-x – year: 2017 ident: 2024111605050385500_bib47 article-title: Evaluating DeepVariant: A new deep learning variant caller from the Google Brain Team – volume: 1 start-page: 5 issue: 1 year: 2014 ident: 2024111605050385500_bib2 article-title: Toward exascale resilience: 2014 update publication-title: Supercomput Front Innov – year: 2020 ident: 2024111605050385500_bib45 article-title: Lustre parallel filesystem – volume: 36 start-page: 983 year: 2018 ident: 2024111605050385500_bib17 article-title: A universal SNP and small-indel variant caller using deep neural networks publication-title: Nat Biotechnol doi: 10.1038/nbt.4235
SSID	ssj0000778873
Score	2.238152
SecondaryResourceType	review_article
Snippet	Abstract Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional... Recently many new deep learning-based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling... Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional...
SourceID	unpaywall pubmedcentral proquest pubmed crossref oup
SourceType	Open Access Repository Aggregation Database Index Database Enrichment Source Publisher
SubjectTerms	Algorithms Big Data Central processing units Clusters Computer applications Computer memory CPUs Data processing Deep learning High performance computing High-Throughput Nucleotide Sequencing - methods Machine learning Next-generation sequencing Resource utilization Software Storage Technical Note Whole genome sequencing Workflow
SummonAdditionalLinks	– databaseName: Unpaywall dbid: UNPAY link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Nb9QwEB2V7QEuUL4DLTISggvetWPHSXqiqqgqJAoCFi2nyHacUrFKV3QDor--440TNoBQkThFisdW7JnYz_K8Z4Ansbbo5TKnpWGWSmkZNZXl1MSuUtIiBjGeKPz6SB1O5atZMtuAdx0XRoes8HFHaTg-OdZhMZiEwaSLssK_fZJjoTYIOCaSiTjFcOpejNHiCmyqBPH5CDanR2_3Pvlb5layBDyddfJDuRi0H2oPlqgB7W0Nff6eRHm1qRf6x3c9n6-tUAc34KzrW5uY8mXcLM3Ynv8i-_h_O78F1wOgJXttvZuw4epbsBPoEOQpCXwn738SJpLb8Obj_ov3GB1ul_iHp28RXZfEiyfTxU8uA_mGW3n0PUEjz5sn2IidN17cgaxT9O7A9ODlh_1DGq52oDZhbOlzGC2rJCtXknZ5hjjUcJeLhJcIcZgyLjE5y1KnrBWZVpXGraHLlDC89FOSuAuj-rR294GgAzIlE6Nw7sLgkppnutLW8tIi-GV5BHHnzcIG3XN__ca8aM_fRbE2ykUYwwie95UWrezH382focsuZ7ndhVIRZouzIk55kiGWTeM_FguucNcnEMtF8LgvxmnAn-3o2p022IRXUvOn7iyCe21g9p8jpMxxUHkE6SBkewMvMT4sqU8-r6TGMxnL1VfTPrgv08sH_2j_EK7FPlfIH9Sl2zBafm3cDoK9pXkU_t4LTqRaqw priority: 102 providerName: Unpaywall
Title	VC@Scale: Scalable and high-performance variant calling on cluster environments
URI	https://www.ncbi.nlm.nih.gov/pubmed/34494101 https://www.proquest.com/docview/2715815472 https://www.proquest.com/docview/3169743782 https://www.proquest.com/docview/2570374230 https://pubmed.ncbi.nlm.nih.gov/PMC8424057 https://academic.oup.com/gigascience/article-pdf/10/9/giab057/40327053/giab057.pdf
UnpaywallVersion	publishedVersion
Volume	10
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVADU databaseName: Open Access: BioMedCentral Open Access Titles customDbUrl: eissn: 2047-217X dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000778873 issn: 2047-217X databaseCode: RBZ dateStart: 20120101 isFulltext: true titleUrlDefault: https://www.biomedcentral.com/search/ providerName: BioMedCentral – providerCode: PRVAFT databaseName: Open Access Digital Library customDbUrl: eissn: 2047-217X dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000778873 issn: 2047-217X databaseCode: KQ8 dateStart: 20120101 isFulltext: true titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html providerName: Colorado Alliance of Research Libraries – providerCode: PRVAFT databaseName: Open Access Digital Library customDbUrl: eissn: 2047-217X dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000778873 issn: 2047-217X databaseCode: KQ8 dateStart: 20110101 isFulltext: true titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html providerName: Colorado Alliance of Research Libraries – providerCode: PRVEBS databaseName: EBSCOhost Academic Search Ultimate customDbUrl: https://search.ebscohost.com/login.aspx?authtype=ip,shib&custid=s3936755&profile=ehost&defaultdb=asn eissn: 2047-217X dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000778873 issn: 2047-217X databaseCode: ABDBF dateStart: 20131201 isFulltext: true titleUrlDefault: https://search.ebscohost.com/direct.asp?db=asn providerName: EBSCOhost – providerCode: PRVBFR databaseName: Free Medical Journals customDbUrl: eissn: 2047-217X dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000778873 issn: 2047-217X databaseCode: DIK dateStart: 20120101 isFulltext: true titleUrlDefault: http://www.freemedicaljournals.com providerName: Flying Publisher – providerCode: PRVFQY databaseName: GFMER Free Medical Journals customDbUrl: eissn: 2047-217X dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000778873 issn: 2047-217X databaseCode: GX1 dateStart: 20120101 isFulltext: true titleUrlDefault: http://www.gfmer.ch/Medical_journals/Free_medical.php providerName: Geneva Foundation for Medical Education and Research – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2047-217X dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000778873 issn: 2047-217X databaseCode: M~E dateStart: 20120101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre – providerCode: PRVAQN databaseName: PubMed Central customDbUrl: eissn: 2047-217X dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000778873 issn: 2047-217X databaseCode: RPM dateStart: 20120101 isFulltext: true titleUrlDefault: https://www.ncbi.nlm.nih.gov/pmc/ providerName: National Library of Medicine – providerCode: PRVASL databaseName: Oxford Journals Open Access Collection customDbUrl: eissn: 2047-217X dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0000778873 issn: 2047-217X databaseCode: TOX dateStart: 20110101 isFulltext: true titleUrlDefault: https://academic.oup.com/journals/ providerName: Oxford University Press – providerCode: PRVFZP databaseName: Scholars Portal Journals: Open Access customDbUrl: eissn: 2047-217X dateEnd: 20250131 omitProxy: true ssIdentifier: ssj0000778873 issn: 2047-217X databaseCode: M48 dateStart: 20120701 isFulltext: true titleUrlDefault: http://journals.scholarsportal.info providerName: Scholars Portal
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1La9wwEB7S5NBeSt9180CFkl6qVrZl2S6ENoSEUEhSaLZsT0aS5TRgnE2y2ySX_vbMeGV3ly4hF9sgWdiakfQNM_MNwLtIW5RymfPSCMultIKbyobcRK5S0iIGMZQofHCo9gfy2zAZLkEXkOkn8HKhaUf1pAYX9cfr85svuOC3PBnSp5PTE-3PC3zWBiHI5uicU2Up8sD6MhsPYAVPr5zKOxx4E6DdrVOKp4s7QqLFw80dWnOJcDN49P-wyoeTZqRvrnRdz5xZe0_gsQebbHuqHU9hyTXPYN2nKrBN5nORSDbML_LncPRz5-sPlJz7zOhGqVVMNyUjYmM--pdnwP6gmY1yYdiJctoZDmLrCREvsNn0uRcw2Ns93tnnvuwCt4kQY4ovtKKSomzp5vIMMaIJXR4nYYnwQyjjEpOLLHXK2jjTqtJotrlMxSYsabuIX8Jyc9a418BwKjIlE6NwX0HBSx1mutLWhqVFYCryAKJuXgvrOcmpNEZdTH3jcTEjjMILI4AP_UujKSXH3d3fo8Du13OtE2rRKWIRpWGSIc5Mo4XNcajQIosRZwXwtm_GJUp-F924swkOQSxn5BEXAbyaqkj_ObGUOU5qGEA6pzx9B6L_nm9pTn-3NOCZjGT71bxXs_v85Zu7_3IVHkUUtkM-s3QNlscXE7eOuGtsNtqFQ9e_u3g9PhpuwMrg8Pv2r1sGAjj3
linkProvider	Scholars Portal
linkToUnpaywall	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Nb9QwEB2V7QEuUL4DLTISggvetWPHSXqiqqgqJAoCFi2nyHacUrFKV3QDor--440TNoBQkThFisdW7JnYz_K8Z4Ansbbo5TKnpWGWSmkZNZXl1MSuUtIiBjGeKPz6SB1O5atZMtuAdx0XRoes8HFHaTg-OdZhMZiEwaSLssK_fZJjoTYIOCaSiTjFcOpejNHiCmyqBPH5CDanR2_3Pvlb5layBDyddfJDuRi0H2oPlqgB7W0Nff6eRHm1qRf6x3c9n6-tUAc34KzrW5uY8mXcLM3Ynv8i-_h_O78F1wOgJXttvZuw4epbsBPoEOQpCXwn738SJpLb8Obj_ov3GB1ul_iHp28RXZfEiyfTxU8uA_mGW3n0PUEjz5sn2IidN17cgaxT9O7A9ODlh_1DGq52oDZhbOlzGC2rJCtXknZ5hjjUcJeLhJcIcZgyLjE5y1KnrBWZVpXGraHLlDC89FOSuAuj-rR294GgAzIlE6Nw7sLgkppnutLW8tIi-GV5BHHnzcIG3XN__ca8aM_fRbE2ykUYwwie95UWrezH382focsuZ7ndhVIRZouzIk55kiGWTeM_FguucNcnEMtF8LgvxmnAn-3o2p022IRXUvOn7iyCe21g9p8jpMxxUHkE6SBkewMvMT4sqU8-r6TGMxnL1VfTPrgv08sH_2j_EK7FPlfIH9Sl2zBafm3cDoK9pXkU_t4LTqRaqw
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=VC%40Scale%3A+Scalable+and+high-performance+variant+calling+on+cluster+environments&rft.jtitle=Gigascience&rft.au=Ahmad%2C+Tanveer&rft.au=Zaid+Al%C2%A0Ars&rft.au=Hofstee%2C+H+Peter&rft.date=2021-09-07&rft.pub=Oxford+University+Press&rft.eissn=2047-217X&rft.volume=10&rft.issue=9&rft_id=info:doi/10.1093%2Fgigascience%2Fgiab057&rft.externalDBID=NO_FULL_TEXT
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2047-217X&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2047-217X&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2047-217X&client=summon