GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data

Copy number variants (CNVs) are major contributors to genetic diversity and disease. While standardized methods, such as the genome analysis toolkit (GATK), exist for detecting short variants, technical challenges have confounded uniform large-scale CNV analyses from whole-exome sequencing (WES) dat...

Full description

Saved in:
Bibliographic Details
Published inNature genetics Vol. 55; no. 9; pp. 1589 - 1597
Main Authors Babadi, Mehrtash, Fu, Jack M., Lee, Samuel K., Smirnov, Andrey N., Gauthier, Laura D., Walker, Mark, Benjamin, David I., Zhao, Xuefang, Karczewski, Konrad J., Wong, Isaac, Collins, Ryan L., Sanchis-Juan, Alba, Brand, Harrison, Banks, Eric, Talkowski, Michael E.
Format Journal Article
LanguageEnglish
Published New York Nature Publishing Group US 01.09.2023
Nature Publishing Group
Subjects
Online AccessGet full text
ISSN1061-4036
1546-1718
1546-1718
DOI10.1038/s41588-023-01449-0

Cover

More Information
Summary:Copy number variants (CNVs) are major contributors to genetic diversity and disease. While standardized methods, such as the genome analysis toolkit (GATK), exist for detecting short variants, technical challenges have confounded uniform large-scale CNV analyses from whole-exome sequencing (WES) data. Given the profound impact of rare and de novo coding CNVs on genome organization and human disease, we developed GATK-gCNV, a flexible algorithm to discover rare CNVs from sequencing read-depth information, complete with open-source distribution via GATK. We benchmarked GATK-gCNV in 7,962 exomes from individuals in quartet families with matched genome sequencing and microarray data, finding up to 95% recall of rare coding CNVs at a resolution of more than two exons. We used GATK-gCNV to generate a reference catalog of rare coding CNVs in WES data from 197,306 individuals in the UK Biobank, and observed strong correlations between per-gene CNV rates and measures of mutational constraint, as well as rare CNV associations with multiple traits. In summary, GATK-gCNV is a tunable approach for sensitive and specific CNV discovery in WES data, with broad applications. GATK-gCNV uses a probabilistic model and inference framework to discover rare copy number variants (CNVs) from sequencing read-depth information. This algorithm is used to generate a reference catalog of rare coding CNVs in exome sequencing data from UK Biobank.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
Author contribution statement M.B., D.I.B, and S.K.L. developed and implemented the GATK-gCNV model and the inference algorithm. A.S. contributed model enhancements and developed sample-clustering and batch-processing workflows. X.Z., A.S., and J.M.F conducted benchmarking studies of GATK-gCNV performance. A.S., M.B, and S.K.L developed WDL workflows for Terra integration and scalable analysis. M.E.T, J.M.F., E.B, H.B., S.K.L., M.W., and L.D.G. supervised aspects of this project at various stages of development. J.M.F., R.L.C., H.B., and K.J.K. contributed to association analyses. I.W. and J.M.F. generated the CNV callsets. J.M.F., I.W., R.L.C., A.S.-J., and H.B. conducted quality-control on generated callsets. M.B., J.M.F., R.L.C., H.B., and M.E.T. wrote the manuscript, which was edited by all authors.
These authors contributed equally
ISSN:1061-4036
1546-1718
1546-1718
DOI:10.1038/s41588-023-01449-0