EnsembleCNV: an ensemble machine learning algorithm to identify and genotype copy number variation using SNP array data
Abstract The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ens...
        Saved in:
      
    
          | Published in | Nucleic acids research Vol. 47; no. 7; p. e39 | 
|---|---|
| Main Authors | , , , , , , , , , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        England
          Oxford University Press
    
        23.04.2019
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 0305-1048 1362-4962 1362-4954 1362-4962  | 
| DOI | 10.1093/nar/gkz068 | 
Cover
| Summary: | Abstract
The associations between diseases/traits and copy number variants (CNVs) have not been systematically investigated in genome-wide association studies (GWASs), primarily due to a lack of robust and accurate tools for CNV genotyping. Herein, we propose a novel ensemble learning framework, ensembleCNV, to detect and genotype CNVs using single nucleotide polymorphism (SNP) array data. EnsembleCNV (a) identifies and eliminates batch effects at raw data level; (b) assembles individual CNV calls into CNV regions (CNVRs) from multiple existing callers with complementary strengths by a heuristic algorithm; (c) re-genotypes each CNVR with local likelihood model adjusted by global information across multiple CNVRs; (d) refines CNVR boundaries by local correlation structure in copy number intensities; (e) provides direct CNV genotyping accompanied with confidence score, directly accessible for downstream quality control and association analysis. Benchmarked on two large datasets, ensembleCNV outperformed competing methods and achieved a high call rate (93.3%) and reproducibility (98.6%), while concurrently achieving high sensitivity by capturing 85% of common CNVs documented in the 1000 Genomes Project. Given this CNV call rate and accuracy, which are comparable to SNP genotyping, we suggest ensembleCNV holds significant promise for performing genome-wide CNV association studies and investigating how CNVs predispose to human diseases. | 
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.  | 
| ISSN: | 0305-1048 1362-4962 1362-4954 1362-4962  | 
| DOI: | 10.1093/nar/gkz068 |