VC@Scale: Scalable and high-performance variant calling on cluster environments

Abstract Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a nee...

Full description

Saved in:
Bibliographic Details
Published inGigascience Vol. 10; no. 9
Main Authors Ahmad, Tanveer, Al Ars, Zaid, Hofstee, H Peter
Format Journal Article
LanguageEnglish
Published United States Oxford University Press 07.09.2021
Subjects
Online AccessGet full text
ISSN2047-217X
2047-217X
DOI10.1093/gigascience/giab057

Cover

Abstract Abstract Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. Results Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. Conclusions We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.
AbstractList Recently many new deep learning-based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow's columnar in-memory data transformations. Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.
Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. Results Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. Conclusions We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.
Recently many new deep learning-based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow's columnar in-memory data transformations.BACKGROUNDRecently many new deep learning-based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow's columnar in-memory data transformations.Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters.RESULTSHere we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters.We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.CONCLUSIONSWe show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.
Abstract Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. Results Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. Conclusions We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.
Author Al Ars, Zaid
Ahmad, Tanveer
Hofstee, H Peter
Author_xml – sequence: 1
  givenname: Tanveer
  orcidid: 0000-0003-0519-2315
  surname: Ahmad
  fullname: Ahmad, Tanveer
  email: t.ahmad@tudelft.nl
– sequence: 2
  givenname: Zaid
  orcidid: 0000-0001-7670-8572
  surname: Al Ars
  fullname: Al Ars, Zaid
– sequence: 3
  givenname: H Peter
  orcidid: 0000-0001-9649-7338
  surname: Hofstee
  fullname: Hofstee, H Peter
BackLink https://www.ncbi.nlm.nih.gov/pubmed/34494101$$D View this record in MEDLINE/PubMed
BookMark eNqNkUtr3DAUhUVJaB7NLygUQzfdONXLltxFSRnSBwSy6IPuhCxfexRsyZXsKfn31eBJO8kijTb3gr5z79HRCTpw3gFCLwk-J7hibzvb6WgsOAOp1zUuxDN0TDEXOSXi58Fef4TOYrzB6QghpWDP0RHjvOIEk2N0_WN18dXoHt5l26LrHjLtmmxtu3U-Qmh9GHRakm10sNpNWYJ667rMu8z0c5wgZOA2Nng3gJviC3TY6j7C2a6eou8fL7-tPudX15--rD5c5abAeMqTEYNbjhtMMKWVJKyqCVSsII1IxsoairrCUkBpDJO6bDWVFGTJatIwTAg7RXyZO7tR3_5OptQY7KDDrSJYbSNSexGpXURJ9n6RjXM9QGOS5aD_Sb226v6Ns2vV-Y2SnPJlwJvdgOB_zRAnNdhooO-1Az9HRQuBmeCU4YS-foDe-Dm4lIpipKwEZ0LSxygqSCFJwcWWerXv-6_hu49MAFsAE3yMAdonxlE9UBk76cn67eNt_x_t-aL18_ikZX8AxlnbDg
CitedBy_id crossref_primary_10_1093_bioinformatics_btac804
crossref_primary_10_1093_gigascience_giab057
crossref_primary_10_1016_j_heliyon_2023_e13368
Cites_doi 10.4172/2329-9533.1000101
10.1038/s41467-019-09027-x
10.1101/667261
10.1093/nar/gkr599
10.1093/bioinformatics/btv098
10.1038/nmeth.1923
10.1093/nar/gkw227
10.1093/nar/gks918
10.1038/nbt.2514
10.1038/s41587-021-00861-3
10.1101/gr.129684.111
10.1093/bioinformatics/btp352
10.3390/genes10110886
10.1371/journal.pone.0155461
10.1371/journal.pone.0086869
10.1093/bioinformatics/btu314
10.1093/gigascience/giab057
10.1093/bioinformatics/btv179
10.1177/1094342004046045
10.1093/bioinformatics/btp324
10.1371/journal.pone.0163962
10.1038/s41592-018-0051-x
10.1038/nbt.4235
ContentType Journal Article
Copyright The Author(s) 2021. Published by Oxford University Press GigaScience. 2021
The Author(s) 2021. Published by Oxford University Press GigaScience.
The Author(s) 2021. Published by Oxford University Press GigaScience. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: The Author(s) 2021. Published by Oxford University Press GigaScience. 2021
– notice: The Author(s) 2021. Published by Oxford University Press GigaScience.
– notice: The Author(s) 2021. Published by Oxford University Press GigaScience. This work is published under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID TOX
AAYXX
CITATION
CGR
CUY
CVF
ECM
EIF
NPM
JQ2
K9.
7X8
5PM
ADTOC
UNPAY
DOI 10.1093/gigascience/giab057
DatabaseName Oxford Journals Open Access Collection
CrossRef
Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
ProQuest Computer Science Collection
ProQuest Health & Medical Complete (Alumni)
MEDLINE - Academic
PubMed Central (Full Participant titles)
Unpaywall for CDI: Periodical Content
Unpaywall
DatabaseTitle CrossRef
MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
ProQuest Health & Medical Complete (Alumni)
ProQuest Computer Science Collection
MEDLINE - Academic
DatabaseTitleList MEDLINE
ProQuest Health & Medical Complete (Alumni)
MEDLINE - Academic

ProQuest Health & Medical Complete (Alumni)
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
– sequence: 3
  dbid: TOX
  name: Oxford Journals Open Access Collection
  url: https://academic.oup.com/journals/
  sourceTypes: Publisher
– sequence: 4
  dbid: UNPAY
  name: Unpaywall
  url: https://proxy.k.utb.cz/login?url=https://unpaywall.org/
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
Discipline Library & Information Science
EISSN 2047-217X
ExternalDocumentID 10.1093/gigascience/giab057
PMC8424057
34494101
10_1093_gigascience_giab057
Genre Research Support, Non-U.S. Gov't
Journal Article
GrantInformation_xml – fundername: ;
GroupedDBID 0R~
4.4
53G
5VS
7X7
88E
88I
8FE
8FG
8FH
8FI
8FJ
AAFWJ
AAHBH
AAPXW
AAVAP
ABDBF
ABEJV
ABGNP
ABPTD
ABUWG
ABXVV
ACGFS
ACPRK
ACUHS
ADBBV
ADRAZ
ADUKV
AEGXH
AENZO
AFKRA
AFPKN
AHBYD
AHSBF
AHYZX
ALIPV
ALMA_UNASSIGNED_HOLDINGS
AMNDL
AOIJS
ARAPS
AZQEC
BAWUL
BAYMD
BBNVY
BCNDV
BENPR
BFQNJ
BGLVJ
BHPHI
BMC
BPHCQ
BVXVI
C6C
CCPQU
DIK
DWQXO
EBS
EJD
FYUFA
GNUQQ
GROUPED_DOAJ
GX1
H13
HCIFZ
HMCUK
HYE
IAO
IGS
IHR
IHW
INH
INR
IPNFZ
ITC
K6V
K7-
KQ8
KSI
LK8
M1P
M2P
M48
M7P
M~E
O9-
OK1
P62
PHGZT
PIMPY
PQQKQ
PROAC
PSQYO
RBZ
RIG
RNS
ROL
RPM
RSV
SBL
SOJ
TJX
TOX
UKHRP
AAYXX
CITATION
CGR
CUY
CVF
ECM
EIF
NPM
JQ2
K9.
7X8
5PM
ADTOC
PHGZM
PJZUB
PPXIY
PQGLB
UNPAY
ID FETCH-LOGICAL-c500t-788c0f40d0102298139b1e9351d74106be5b9087e6cc38a6fa282e863b1d30113
IEDL.DBID M48
ISSN 2047-217X
IngestDate Sun Oct 26 03:53:26 EDT 2025
Thu Aug 21 13:34:22 EDT 2025
Thu Oct 02 04:14:57 EDT 2025
Tue Oct 07 06:41:11 EDT 2025
Tue Oct 07 06:29:30 EDT 2025
Mon Jul 21 06:02:04 EDT 2025
Wed Oct 01 02:41:43 EDT 2025
Thu Apr 24 22:53:12 EDT 2025
Wed Apr 02 07:04:38 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 9
Keywords whole-genome sequencing
sorting
MarkDuplicate
DeepVariant
BWA-MEM
Apache Spark
Apache Arrow
Language English
License This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
https://creativecommons.org/licenses/by/4.0
The Author(s) 2021. Published by Oxford University Press GigaScience.
cc-by
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c500t-788c0f40d0102298139b1e9351d74106be5b9087e6cc38a6fa282e863b1d30113
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ORCID 0000-0003-0519-2315
0000-0001-7670-8572
0000-0001-9649-7338
OpenAccessLink http://journals.scholarsportal.info/openUrl.xqy?doi=10.1093/gigascience/giab057
PMID 34494101
PQID 2715815472
PQPubID 2040230
ParticipantIDs unpaywall_primary_10_1093_gigascience_giab057
pubmedcentral_primary_oai_pubmedcentral_nih_gov_8424057
proquest_miscellaneous_2570374230
proquest_journals_3169743782
proquest_journals_2715815472
pubmed_primary_34494101
crossref_primary_10_1093_gigascience_giab057
crossref_citationtrail_10_1093_gigascience_giab057
oup_primary_10_1093_gigascience_giab057
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 20210907
PublicationDateYYYYMMDD 2021-09-07
PublicationDate_xml – month: 9
  year: 2021
  text: 20210907
  day: 7
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
– name: Oxford
PublicationTitle Gigascience
PublicationTitleAlternate Gigascience
PublicationYear 2021
Publisher Oxford University Press
Publisher_xml – name: Oxford University Press
References Ahmad (2024111605050385500_bib48) 2021
Apache (2024111605050385500_bib37) 2019
Zhang (2024111605050385500_bib10) 2019; 10
Shen (2024111605050385500_bib36) 2016; 11
Cooke (2024111605050385500_bib23) 2021; 39
Apache (2024111605050385500_bib34) 2019
Garrison (2024111605050385500_bib24) 2012
(ENA) TENA (2024111605050385500_bib41) 2020
Apache (2024111605050385500_bib33) 2019
Carroll (2024111605050385500_bib47) 2017
Sahraeian (2024111605050385500_bib22) 2019
Krusche (2024111605050385500_bib49) 2021
Darling (2024111605050385500_bib31) 2003; 2003
Lustre (2024111605050385500_bib45) 2020
SurfSara (2024111605050385500_bib44) 2020
Illumina (2024111605050385500_bib40) 2012
Picard toolkit (2024111605050385500_bib14)
Mushtaq (2024111605050385500_bib6) 2017
Wilm (2024111605050385500_bib27) 2012; 40
Li (2024111605050385500_bib12) 2009; 25
Apache (2024111605050385500_bib5) 2019
GIAB (2024111605050385500_bib42) 2020
Sahraeian (2024111605050385500_bib21) 2019; 10
Kim (2024111605050385500_bib25) 2018; 15
Slurm (2024111605050385500_bib46) 2020
FDA (2024111605050385500_bib38) 2019
Faust (2024111605050385500_bib16) 2014; 30
Broad Institute (2024111605050385500_bib9) 2018
Cappello (2024111605050385500_bib2) 2014; 1
Luo (2024111605050385500_bib30) 2012; 1
Wei (2024111605050385500_bib26) 2011; 39
FDA (2024111605050385500_bib28) 2019
Tarasov (2024111605050385500_bib15) 2015; 31
Lai (2024111605050385500_bib19) 2016; 44
Poplin (2024111605050385500_bib17) 2018; 36
2024111605050385500_bib50
Langmead (2024111605050385500_bib11) 2012; 9
Cibulskis (2024111605050385500_bib20) 2013; 31
Abuín (2024111605050385500_bib8) 2016; 11
Jin (2024111605050385500_bib35) 2018
FDA (2024111605050385500_bib29) 2019
UCSC (2024111605050385500_bib43) 2020
Gropp (2024111605050385500_bib1) 2004; 18
Koboldt (2024111605050385500_bib18) 2012; 22
UCSC (2024111605050385500_bib39) 2018
Massie (2024111605050385500_bib7) 2013
Li (2024111605050385500_bib13) 2009; 1
Decap (2024111605050385500_bib4) 2015; 31
Apache Apache Hadoop (2024111605050385500_bib3) 2019
Liu (2024111605050385500_bib32) 2014; 9
References_xml – volume: 1
  issue: 1
  year: 2012
  ident: 2024111605050385500_bib30
  article-title: Speeding up large-scale next generation sequencing data analysis with pBWA
  publication-title: J Appl Bioinform Comput Biol
  doi: 10.4172/2329-9533.1000101
– year: 2020
  ident: 2024111605050385500_bib42
  article-title: NHGRI Illumina 300X BAM
– volume: 10
  start-page: 1041
  issue: 1
  year: 2019
  ident: 2024111605050385500_bib21
  article-title: Deep convolutional neural networks for accurate somatic mutation detection
  publication-title: Nat Commun
  doi: 10.1038/s41467-019-09027-x
– year: 2020
  ident: 2024111605050385500_bib41
  article-title: Illumina 30X
– year: 2019
  ident: 2024111605050385500_bib22
  article-title: Robust cancer mutation detection with deep learning models derived from tumor-normal sequencing data
  doi: 10.1101/667261
– year: 2019
  ident: 2024111605050385500_bib34
  article-title: PySpark Usage Guide for Pandas with Apache Arrow
– volume: 39
  start-page: e132
  issue: 19
  year: 2011
  ident: 2024111605050385500_bib26
  article-title: SNVer: A statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data
  publication-title: Nucleic Acids Res
  doi: 10.1093/nar/gkr599
– year: 2019
  ident: 2024111605050385500_bib38
  article-title: precisionFDA: A community platform for NGS assay evaluation and regulatory science exploration
– volume: 31
  start-page: 2032
  issue: 12
  year: 2015
  ident: 2024111605050385500_bib15
  article-title: Sambamba: Fast processing of NGS alignment formats
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btv098
– volume: 9
  start-page: 357
  issue: 4
  year: 2012
  ident: 2024111605050385500_bib11
  article-title: Fast gapped-read alignment with Bowtie 2
  publication-title: Nat Methods
  doi: 10.1038/nmeth.1923
– volume: 44
  start-page: e108
  issue: 11
  year: 2016
  ident: 2024111605050385500_bib19
  article-title: VarDict: A novel and versatile variant caller for next-generation sequencing in cancer research
  publication-title: Nucleic Acids Res
  doi: 10.1093/nar/gkw227
– volume: 40
  start-page: 11189
  issue: 22
  year: 2012
  ident: 2024111605050385500_bib27
  article-title: LoFreq: A sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets
  publication-title: Nucleic Acids Res
  doi: 10.1093/nar/gks918
– volume: 2003
  year: 2003
  ident: 2024111605050385500_bib31
  article-title: The design, implementation, and evaluation of mpiBLAST
  publication-title: Proc Cluster World
– ident: 2024111605050385500_bib14
  article-title: Broad Institute
– volume: 31
  start-page: 213
  year: 2013
  ident: 2024111605050385500_bib20
  article-title: Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples
  publication-title: Nat Biotechnol
  doi: 10.1038/nbt.2514
– year: 2019
  ident: 2024111605050385500_bib5
  article-title: Apache Spark: Lightning-fast unified analytics engine
– volume: 39
  start-page: 885
  year: 2021
  ident: 2024111605050385500_bib23
  article-title: A unified haplotype-based method for accurate and comprehensive variant calling
  publication-title: Nat Biotechnol
  doi: 10.1038/s41587-021-00861-3
– year: 2018
  ident: 2024111605050385500_bib39
  article-title: faSplit
– volume: 22
  start-page: 568
  issue: 3
  year: 2012
  ident: 2024111605050385500_bib18
  article-title: VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing
  publication-title: Genome Res
  doi: 10.1101/gr.129684.111
– year: 2019
  ident: 2024111605050385500_bib37
  article-title: Plasma In-Memory Object Store
– year: 2020
  ident: 2024111605050385500_bib46
  article-title: Slurm workload manager
– volume: 1
  start-page: 2078
  issue: 25
  year: 2009
  ident: 2024111605050385500_bib13
  article-title: The Sequence Alignment/Map format and SAMtools
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btp352
– year: 2021
  ident: 2024111605050385500_bib49
  article-title: Haplotype VCF comparison tools
– year: 2018
  ident: 2024111605050385500_bib9
  article-title: BWA on Spark
– year: 2012
  ident: 2024111605050385500_bib40
  article-title: Illumina Cambridge Ltd
– volume: 10
  start-page: 886
  issue: 11
  year: 2019
  ident: 2024111605050385500_bib10
  article-title: PipeMEM: A framework to speed up BWA-MEM in Spark with low overhead
  publication-title: Genes
  doi: 10.3390/genes10110886
– year: 2019
  ident: 2024111605050385500_bib29
  article-title: PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions
– year: 2013
  ident: 2024111605050385500_bib7
  article-title: ADAM: Genomics formats and processing patterns for cloud scale computing
– start-page: 148
  volume-title: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB ’17, Boston, MA, USA
  year: 2017
  ident: 2024111605050385500_bib6
  article-title: SparkGA: A Spark framework for cost effective, fast and accurate DNA analysis at scale
– volume: 11
  start-page: 1
  issue: 5
  year: 2016
  ident: 2024111605050385500_bib8
  article-title: SparkBWA: Speeding up the alignment of high-throughput DNA sequencing data
  publication-title: PLoS One
  doi: 10.1371/journal.pone.0155461
– volume: 9
  issue: 1
  year: 2014
  ident: 2024111605050385500_bib32
  article-title: CUSHAW3: Sensitive and accurate base-space and color-space short-read alignment with hybrid seeding
  publication-title: PLoS One
  doi: 10.1371/journal.pone.0086869
– year: 2018
  ident: 2024111605050385500_bib35
  article-title: Introducing Pandas UDF for PySpark
– year: 2019
  ident: 2024111605050385500_bib33
  article-title: Apache Arrow: A cross-language development platform for in-memory data
– volume: 30
  start-page: 2503
  issue: 17
  year: 2014
  ident: 2024111605050385500_bib16
  article-title: SAMBLASTER: Fast duplicate marking and structural variant read extraction
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btu314
– year: 2012
  ident: 2024111605050385500_bib24
  article-title: Haplotype-based variant detection from short-read sequencing
– ident: 2024111605050385500_bib50
  doi: 10.1093/gigascience/giab057
– volume: 31
  start-page: 2482
  issue: 15
  year: 2015
  ident: 2024111605050385500_bib4
  article-title: Halvade: scalable sequence analysis with MapReduce
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btv179
– year: 2019
  ident: 2024111605050385500_bib3
– year: 2021
  ident: 2024111605050385500_bib48
  article-title: Standalone pre-processing on clusters
– volume: 18
  start-page: 363
  issue: 3
  year: 2004
  ident: 2024111605050385500_bib1
  article-title: Fault tolerance in message passing interface programs
  publication-title: Int J High Perform Comput Appl
  doi: 10.1177/1094342004046045
– year: 2020
  ident: 2024111605050385500_bib44
  article-title: Cartesius: the Dutch supercomputer
– volume: 25
  start-page: 1754
  issue: 14
  year: 2009
  ident: 2024111605050385500_bib12
  article-title: Fast and accurate short read alignment with Burrows–Wheeler transform
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btp324
– volume: 11
  issue: 10
  year: 2016
  ident: 2024111605050385500_bib36
  article-title: SeqKit: A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
  publication-title: PLoS One
  doi: 10.1371/journal.pone.0163962
– year: 2019
  ident: 2024111605050385500_bib28
  article-title: PrecisionFDA Truth Challenge
– year: 2020
  ident: 2024111605050385500_bib43
  article-title: UCSC hg19 (GRCh37)
– volume: 15
  start-page: 591
  issue: 8
  year: 2018
  ident: 2024111605050385500_bib25
  article-title: Strelka2: fast and accurate calling of germline and somatic variants
  publication-title: Nat Methods
  doi: 10.1038/s41592-018-0051-x
– year: 2017
  ident: 2024111605050385500_bib47
  article-title: Evaluating DeepVariant: A new deep learning variant caller from the Google Brain Team
– volume: 1
  start-page: 5
  issue: 1
  year: 2014
  ident: 2024111605050385500_bib2
  article-title: Toward exascale resilience: 2014 update
  publication-title: Supercomput Front Innov
– year: 2020
  ident: 2024111605050385500_bib45
  article-title: Lustre parallel filesystem
– volume: 36
  start-page: 983
  year: 2018
  ident: 2024111605050385500_bib17
  article-title: A universal SNP and small-indel variant caller using deep neural networks
  publication-title: Nat Biotechnol
  doi: 10.1038/nbt.4235
SSID ssj0000778873
Score 2.238152
SecondaryResourceType review_article
Snippet Abstract Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional...
Recently many new deep learning-based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling...
Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional...
SourceID unpaywall
pubmedcentral
proquest
pubmed
crossref
oup
SourceType Open Access Repository
Aggregation Database
Index Database
Enrichment Source
Publisher
SubjectTerms Algorithms
Big Data
Central processing units
Clusters
Computer applications
Computer memory
CPUs
Data processing
Deep learning
High performance computing
High-Throughput Nucleotide Sequencing - methods
Machine learning
Next-generation sequencing
Resource utilization
Software
Storage
Technical Note
Whole genome sequencing
Workflow
SummonAdditionalLinks – databaseName: Unpaywall
  dbid: UNPAY
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Nb9QwEB2V7QEuUL4DLTISggvetWPHSXqiqqgqJAoCFi2nyHacUrFKV3QDor--440TNoBQkThFisdW7JnYz_K8Z4Ansbbo5TKnpWGWSmkZNZXl1MSuUtIiBjGeKPz6SB1O5atZMtuAdx0XRoes8HFHaTg-OdZhMZiEwaSLssK_fZJjoTYIOCaSiTjFcOpejNHiCmyqBPH5CDanR2_3Pvlb5layBDyddfJDuRi0H2oPlqgB7W0Nff6eRHm1qRf6x3c9n6-tUAc34KzrW5uY8mXcLM3Ynv8i-_h_O78F1wOgJXttvZuw4epbsBPoEOQpCXwn738SJpLb8Obj_ov3GB1ul_iHp28RXZfEiyfTxU8uA_mGW3n0PUEjz5sn2IidN17cgaxT9O7A9ODlh_1DGq52oDZhbOlzGC2rJCtXknZ5hjjUcJeLhJcIcZgyLjE5y1KnrBWZVpXGraHLlDC89FOSuAuj-rR294GgAzIlE6Nw7sLgkppnutLW8tIi-GV5BHHnzcIG3XN__ca8aM_fRbE2ykUYwwie95UWrezH382focsuZ7ndhVIRZouzIk55kiGWTeM_FguucNcnEMtF8LgvxmnAn-3o2p022IRXUvOn7iyCe21g9p8jpMxxUHkE6SBkewMvMT4sqU8-r6TGMxnL1VfTPrgv08sH_2j_EK7FPlfIH9Sl2zBafm3cDoK9pXkU_t4LTqRaqw
  priority: 102
  providerName: Unpaywall
Title VC@Scale: Scalable and high-performance variant calling on cluster environments
URI https://www.ncbi.nlm.nih.gov/pubmed/34494101
https://www.proquest.com/docview/2715815472
https://www.proquest.com/docview/3169743782
https://www.proquest.com/docview/2570374230
https://pubmed.ncbi.nlm.nih.gov/PMC8424057
https://academic.oup.com/gigascience/article-pdf/10/9/giab057/40327053/giab057.pdf
UnpaywallVersion publishedVersion
Volume 10
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVADU
  databaseName: Open Access: BioMedCentral Open Access Titles
  customDbUrl:
  eissn: 2047-217X
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000778873
  issn: 2047-217X
  databaseCode: RBZ
  dateStart: 20120101
  isFulltext: true
  titleUrlDefault: https://www.biomedcentral.com/search/
  providerName: BioMedCentral
– providerCode: PRVAFT
  databaseName: Open Access Digital Library
  customDbUrl:
  eissn: 2047-217X
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000778873
  issn: 2047-217X
  databaseCode: KQ8
  dateStart: 20120101
  isFulltext: true
  titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html
  providerName: Colorado Alliance of Research Libraries
– providerCode: PRVAFT
  databaseName: Open Access Digital Library
  customDbUrl:
  eissn: 2047-217X
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000778873
  issn: 2047-217X
  databaseCode: KQ8
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: http://grweb.coalliance.org/oadl/oadl.html
  providerName: Colorado Alliance of Research Libraries
– providerCode: PRVEBS
  databaseName: EBSCOhost Academic Search Ultimate
  customDbUrl: https://search.ebscohost.com/login.aspx?authtype=ip,shib&custid=s3936755&profile=ehost&defaultdb=asn
  eissn: 2047-217X
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000778873
  issn: 2047-217X
  databaseCode: ABDBF
  dateStart: 20131201
  isFulltext: true
  titleUrlDefault: https://search.ebscohost.com/direct.asp?db=asn
  providerName: EBSCOhost
– providerCode: PRVBFR
  databaseName: Free Medical Journals
  customDbUrl:
  eissn: 2047-217X
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000778873
  issn: 2047-217X
  databaseCode: DIK
  dateStart: 20120101
  isFulltext: true
  titleUrlDefault: http://www.freemedicaljournals.com
  providerName: Flying Publisher
– providerCode: PRVFQY
  databaseName: GFMER Free Medical Journals
  customDbUrl:
  eissn: 2047-217X
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000778873
  issn: 2047-217X
  databaseCode: GX1
  dateStart: 20120101
  isFulltext: true
  titleUrlDefault: http://www.gfmer.ch/Medical_journals/Free_medical.php
  providerName: Geneva Foundation for Medical Education and Research
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2047-217X
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000778873
  issn: 2047-217X
  databaseCode: M~E
  dateStart: 20120101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
– providerCode: PRVAQN
  databaseName: PubMed Central
  customDbUrl:
  eissn: 2047-217X
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000778873
  issn: 2047-217X
  databaseCode: RPM
  dateStart: 20120101
  isFulltext: true
  titleUrlDefault: https://www.ncbi.nlm.nih.gov/pmc/
  providerName: National Library of Medicine
– providerCode: PRVASL
  databaseName: Oxford Journals Open Access Collection
  customDbUrl:
  eissn: 2047-217X
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0000778873
  issn: 2047-217X
  databaseCode: TOX
  dateStart: 20110101
  isFulltext: true
  titleUrlDefault: https://academic.oup.com/journals/
  providerName: Oxford University Press
– providerCode: PRVFZP
  databaseName: Scholars Portal Journals: Open Access
  customDbUrl:
  eissn: 2047-217X
  dateEnd: 20250131
  omitProxy: true
  ssIdentifier: ssj0000778873
  issn: 2047-217X
  databaseCode: M48
  dateStart: 20120701
  isFulltext: true
  titleUrlDefault: http://journals.scholarsportal.info
  providerName: Scholars Portal
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1La9wwEB7S5NBeSt9180CFkl6qVrZl2S6ENoSEUEhSaLZsT0aS5TRgnE2y2ySX_vbMeGV3ly4hF9sgWdiakfQNM_MNwLtIW5RymfPSCMultIKbyobcRK5S0iIGMZQofHCo9gfy2zAZLkEXkOkn8HKhaUf1pAYX9cfr85svuOC3PBnSp5PTE-3PC3zWBiHI5uicU2Up8sD6MhsPYAVPr5zKOxx4E6DdrVOKp4s7QqLFw80dWnOJcDN49P-wyoeTZqRvrnRdz5xZe0_gsQebbHuqHU9hyTXPYN2nKrBN5nORSDbML_LncPRz5-sPlJz7zOhGqVVMNyUjYmM--pdnwP6gmY1yYdiJctoZDmLrCREvsNn0uRcw2Ns93tnnvuwCt4kQY4ovtKKSomzp5vIMMaIJXR4nYYnwQyjjEpOLLHXK2jjTqtJotrlMxSYsabuIX8Jyc9a418BwKjIlE6NwX0HBSx1mutLWhqVFYCryAKJuXgvrOcmpNEZdTH3jcTEjjMILI4AP_UujKSXH3d3fo8Du13OtE2rRKWIRpWGSIc5Mo4XNcajQIosRZwXwtm_GJUp-F924swkOQSxn5BEXAbyaqkj_ObGUOU5qGEA6pzx9B6L_nm9pTn-3NOCZjGT71bxXs_v85Zu7_3IVHkUUtkM-s3QNlscXE7eOuGtsNtqFQ9e_u3g9PhpuwMrg8Pv2r1sGAjj3
linkProvider Scholars Portal
linkToUnpaywall http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Nb9QwEB2V7QEuUL4DLTISggvetWPHSXqiqqgqJAoCFi2nyHacUrFKV3QDor--440TNoBQkThFisdW7JnYz_K8Z4Ansbbo5TKnpWGWSmkZNZXl1MSuUtIiBjGeKPz6SB1O5atZMtuAdx0XRoes8HFHaTg-OdZhMZiEwaSLssK_fZJjoTYIOCaSiTjFcOpejNHiCmyqBPH5CDanR2_3Pvlb5layBDyddfJDuRi0H2oPlqgB7W0Nff6eRHm1qRf6x3c9n6-tUAc34KzrW5uY8mXcLM3Ynv8i-_h_O78F1wOgJXttvZuw4epbsBPoEOQpCXwn738SJpLb8Obj_ov3GB1ul_iHp28RXZfEiyfTxU8uA_mGW3n0PUEjz5sn2IidN17cgaxT9O7A9ODlh_1DGq52oDZhbOlzGC2rJCtXknZ5hjjUcJeLhJcIcZgyLjE5y1KnrBWZVpXGraHLlDC89FOSuAuj-rR294GgAzIlE6Nw7sLgkppnutLW8tIi-GV5BHHnzcIG3XN__ca8aM_fRbE2ykUYwwie95UWrezH382focsuZ7ndhVIRZouzIk55kiGWTeM_FguucNcnEMtF8LgvxmnAn-3o2p022IRXUvOn7iyCe21g9p8jpMxxUHkE6SBkewMvMT4sqU8-r6TGMxnL1VfTPrgv08sH_2j_EK7FPlfIH9Sl2zBafm3cDoK9pXkU_t4LTqRaqw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=VC%40Scale%3A+Scalable+and+high-performance+variant+calling+on+cluster+environments&rft.jtitle=Gigascience&rft.au=Ahmad%2C+Tanveer&rft.au=Zaid+Al%C2%A0Ars&rft.au=Hofstee%2C+H+Peter&rft.date=2021-09-07&rft.pub=Oxford+University+Press&rft.eissn=2047-217X&rft.volume=10&rft.issue=9&rft_id=info:doi/10.1093%2Fgigascience%2Fgiab057&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2047-217X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2047-217X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2047-217X&client=summon