Integrative analysis of sequencing and array genotype data for discovering disease associations with rare mutations

In the large cohorts that have been used for genome-wide association studies (GWAS), it is prohibitively expensive to sequence all cohort members. A cost-effective strategy is to sequence subjects with extreme values of quantitative traits or those with specific diseases. By imputing the sequencing...

Full description

Saved in:
Bibliographic Details
Published inProceedings of the National Academy of Sciences - PNAS Vol. 112; no. 4; pp. 1019 - 1024
Main Authors Hu, Yi-Juan, Li, Yun, Auer, Paul L., Lin, Dan-Yu
Format Journal Article
LanguageEnglish
Published United States National Academy of Sciences 27.01.2015
National Acad Sciences
Subjects
Online AccessGet full text
ISSN0027-8424
1091-6490
1091-6490
DOI10.1073/pnas.1406143112

Cover

More Information
Summary:In the large cohorts that have been used for genome-wide association studies (GWAS), it is prohibitively expensive to sequence all cohort members. A cost-effective strategy is to sequence subjects with extreme values of quantitative traits or those with specific diseases. By imputing the sequencing data from the GWAS data for the cohort members who are not selected for sequencing, one can dramatically increase the number of subjects with information on rare variants. However, ignoring the uncertainties of imputed rare variants in downstream association analysis will inflate the type I error when sequenced subjects are not a random subset of the GWAS subjects. In this article, we provide a valid and efficient approach to combining observed and imputed data on rare variants. We consider commonly used gene-level association tests, all of which are constructed from the score statistic for assessing the effects of individual variants on the trait of interest. We show that the score statistic based on the observed genotypes for sequenced subjects and the imputed genotypes for nonsequenced subjects is unbiased. We derive a robust variance estimator that reflects the true variability of the score statistic regardless of the sampling scheme and imputation quality, such that the corresponding association tests always have correct type I error. We demonstrate through extensive simulation studies that the proposed tests are substantially more powerful than the use of accurately imputed variants only and the use of sequencing data alone. We provide an application to the Women’s Health Initiative. The relevant software is freely available. Significance High-throughput DNA sequencing provides an unprecedented opportunity to discover rare genetic variants associated with complex diseases and traits. However, sequencing a large number of subjects is prohibitively expensive. It is common to select subjects for sequencing from the cohorts that have collected genotyping array data. We impute the sequencing data from the array data for the cohort members who are not selected for sequencing and perform gene-level association tests for rare variants by properly combining the observed genotypes for sequenced subjects and the imputed genotypes for nonsequenced subjects. This integrative analysis is substantially more powerful than the use of sequencing data alone and can accelerate the search for disease-causing mutations.
Bibliography:http://dx.doi.org/10.1073/pnas.1406143112
SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 14
ObjectType-Article-1
ObjectType-Feature-2
content type line 23
Author contributions: Y.-J.H. and D.-Y.L. designed research; Y.-J.H. and D.-Y.L. performed research; Y.-J.H., Y.L., and P.L.A. analyzed data; and Y.-J.H. and D.-Y.L. wrote the paper.
Edited by Elizabeth A. Thompson, University of Washington, Seattle, WA, and approved December 9, 2014 (received for review April 3, 2014)
ISSN:0027-8424
1091-6490
1091-6490
DOI:10.1073/pnas.1406143112