Integrative analysis of sequencing and array genotype data for discovering disease associations with rare mutations
In the large cohorts that have been used for genome-wide association studies (GWAS), it is prohibitively expensive to sequence all cohort members. A cost-effective strategy is to sequence subjects with extreme values of quantitative traits or those with specific diseases. By imputing the sequencing...
Saved in:
| Published in | Proceedings of the National Academy of Sciences - PNAS Vol. 112; no. 4; pp. 1019 - 1024 |
|---|---|
| Main Authors | , , , |
| Format | Journal Article |
| Language | English |
| Published |
United States
National Academy of Sciences
27.01.2015
National Acad Sciences |
| Subjects | |
| Online Access | Get full text |
| ISSN | 0027-8424 1091-6490 1091-6490 |
| DOI | 10.1073/pnas.1406143112 |
Cover
| Summary: | In the large cohorts that have been used for genome-wide association studies (GWAS), it is prohibitively expensive to sequence all cohort members. A cost-effective strategy is to sequence subjects with extreme values of quantitative traits or those with specific diseases. By imputing the sequencing data from the GWAS data for the cohort members who are not selected for sequencing, one can dramatically increase the number of subjects with information on rare variants. However, ignoring the uncertainties of imputed rare variants in downstream association analysis will inflate the type I error when sequenced subjects are not a random subset of the GWAS subjects. In this article, we provide a valid and efficient approach to combining observed and imputed data on rare variants. We consider commonly used gene-level association tests, all of which are constructed from the score statistic for assessing the effects of individual variants on the trait of interest. We show that the score statistic based on the observed genotypes for sequenced subjects and the imputed genotypes for nonsequenced subjects is unbiased. We derive a robust variance estimator that reflects the true variability of the score statistic regardless of the sampling scheme and imputation quality, such that the corresponding association tests always have correct type I error. We demonstrate through extensive simulation studies that the proposed tests are substantially more powerful than the use of accurately imputed variants only and the use of sequencing data alone. We provide an application to the Women’s Health Initiative. The relevant software is freely available.
Significance High-throughput DNA sequencing provides an unprecedented opportunity to discover rare genetic variants associated with complex diseases and traits. However, sequencing a large number of subjects is prohibitively expensive. It is common to select subjects for sequencing from the cohorts that have collected genotyping array data. We impute the sequencing data from the array data for the cohort members who are not selected for sequencing and perform gene-level association tests for rare variants by properly combining the observed genotypes for sequenced subjects and the imputed genotypes for nonsequenced subjects. This integrative analysis is substantially more powerful than the use of sequencing data alone and can accelerate the search for disease-causing mutations. |
|---|---|
| Bibliography: | http://dx.doi.org/10.1073/pnas.1406143112 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23 Author contributions: Y.-J.H. and D.-Y.L. designed research; Y.-J.H. and D.-Y.L. performed research; Y.-J.H., Y.L., and P.L.A. analyzed data; and Y.-J.H. and D.-Y.L. wrote the paper. Edited by Elizabeth A. Thompson, University of Washington, Seattle, WA, and approved December 9, 2014 (received for review April 3, 2014) |
| ISSN: | 0027-8424 1091-6490 1091-6490 |
| DOI: | 10.1073/pnas.1406143112 |