PolyBoost: An enhanced genomic variant classifier using extreme gradient boosting
Purpose Human exome sequences contain 15,000–20,000 variants but many variants have unknown clinical impact. In silico predictive classifiers are recognized by the American College of Medical Genetics as a resource for interpreting these “variants of uncertain significance.” Many in silico classifie...
Saved in:
| Published in | Proteomics (Weinheim) Vol. 15; no. 2-3; pp. e1900124 - n/a |
|---|---|
| Main Author | |
| Format | Journal Article |
| Language | English |
| Published |
Germany
Wiley Subscription Services, Inc
01.05.2021
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 1862-8346 1615-9853 1862-8354 1862-8354 1615-9861 |
| DOI | 10.1002/prca.201900124 |
Cover
| Summary: | Purpose
Human exome sequences contain 15,000–20,000 variants but many variants have unknown clinical impact. In silico predictive classifiers are recognized by the American College of Medical Genetics as a resource for interpreting these “variants of uncertain significance.” Many in silico classifiers have been developed, of which PolyPhen‐2 is highly successful and widely used. PolyPhen‐2 uses a naïve Bayes model to synthesize sequence, structural and genomic information. I investigated whether predictive performance could be improved by replacing PolyPhen‐2′s naïve Bayes model with alternative machine learning methods.
Experimental design
Classifiers using the PolyPhen‐2 feature set were retrained using extreme gradient boosting (XGBoost), random forests, artificial neural networks, and support vector machines. Classifiers were externally validated on “pathogenic” and “benign” ClinVar variants absent from the training datasets. Software is implemented in Python and is freely available at https://github.com/djparente/polyboost and the Python Package Index (PyPI) under the BSD license.
Results
An XGBoost‐based classifier—designated PolyBoost (PolyPhen‐2 Booster)—improves discriminative performance and calibration relative to PolyPhen‐2 in external validation on ClinVar.
Conclusions and clinical relevance
PolyBoost analyzes PolyPhen‐2 output and can be incorporated into existing bioinformatics workflows as a post‐analysis method to improve interpretation of clinical exome sequences obtained to identify monogenic disease. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
| ISSN: | 1862-8346 1615-9853 1862-8354 1862-8354 1615-9861 |
| DOI: | 10.1002/prca.201900124 |