PolyBoost: An enhanced genomic variant classifier using extreme gradient boosting

Purpose Human exome sequences contain 15,000–20,000 variants but many variants have unknown clinical impact. In silico predictive classifiers are recognized by the American College of Medical Genetics as a resource for interpreting these “variants of uncertain significance.” Many in silico classifie...

Full description

Saved in:
Bibliographic Details
Published inProteomics (Weinheim) Vol. 15; no. 2-3; pp. e1900124 - n/a
Main Author Parente, Daniel J.
Format Journal Article
LanguageEnglish
Published Germany Wiley Subscription Services, Inc 01.05.2021
Subjects
Online AccessGet full text
ISSN1862-8346
1615-9853
1862-8354
1862-8354
1615-9861
DOI10.1002/prca.201900124

Cover

More Information
Summary:Purpose Human exome sequences contain 15,000–20,000 variants but many variants have unknown clinical impact. In silico predictive classifiers are recognized by the American College of Medical Genetics as a resource for interpreting these “variants of uncertain significance.” Many in silico classifiers have been developed, of which PolyPhen‐2 is highly successful and widely used. PolyPhen‐2 uses a naïve Bayes model to synthesize sequence, structural and genomic information. I investigated whether predictive performance could be improved by replacing PolyPhen‐2′s naïve Bayes model with alternative machine learning methods. Experimental design Classifiers using the PolyPhen‐2 feature set were retrained using extreme gradient boosting (XGBoost), random forests, artificial neural networks, and support vector machines. Classifiers were externally validated on “pathogenic” and “benign” ClinVar variants absent from the training datasets. Software is implemented in Python and is freely available at https://github.com/djparente/polyboost and the Python Package Index (PyPI) under the BSD license. Results An XGBoost‐based classifier—designated PolyBoost (PolyPhen‐2 Booster)—improves discriminative performance and calibration relative to PolyPhen‐2 in external validation on ClinVar. Conclusions and clinical relevance PolyBoost analyzes PolyPhen‐2 output and can be incorporated into existing bioinformatics workflows as a post‐analysis method to improve interpretation of clinical exome sequences obtained to identify monogenic disease.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:1862-8346
1615-9853
1862-8354
1862-8354
1615-9861
DOI:10.1002/prca.201900124