PolyBoost: An enhanced genomic variant classifier using extreme gradient boosting

Purpose Human exome sequences contain 15,000–20,000 variants but many variants have unknown clinical impact. In silico predictive classifiers are recognized by the American College of Medical Genetics as a resource for interpreting these “variants of uncertain significance.” Many in silico classifie...

Full description

Saved in:

Bibliographic Details
Published in	Proteomics (Weinheim) Vol. 15; no. 2-3; pp. e1900124 - n/a
Main Author	Parente, Daniel J.
Format	Journal Article
Language	English
Published	Germany Wiley Subscription Services, Inc 01.05.2021
Subjects	Artificial neural networks Bayes Theorem Bayesian analysis Bioinformatics Classifiers Computational Biology - methods Design of experiments Exome - genetics exome interpretation Experimental design Genetic Variation Genetics Genomics Genomics - methods gradient boosting Humans Impact prediction Learning algorithms Machine Learning Neural networks Neural Networks, Computer Performance prediction Software Support Vector Machine Support vector machines variant classification variant of uncertain significance variant classification gradient boosting machine learning variant of uncertain significance exome interpretation
Online Access	Get full text
ISSN	1862-8346 1615-9853 1862-8354 1862-8354 1615-9861
DOI	10.1002/prca.201900124

Cover

More Information
Summary:	Purpose Human exome sequences contain 15,000–20,000 variants but many variants have unknown clinical impact. In silico predictive classifiers are recognized by the American College of Medical Genetics as a resource for interpreting these “variants of uncertain significance.” Many in silico classifiers have been developed, of which PolyPhen‐2 is highly successful and widely used. PolyPhen‐2 uses a naïve Bayes model to synthesize sequence, structural and genomic information. I investigated whether predictive performance could be improved by replacing PolyPhen‐2′s naïve Bayes model with alternative machine learning methods. Experimental design Classifiers using the PolyPhen‐2 feature set were retrained using extreme gradient boosting (XGBoost), random forests, artificial neural networks, and support vector machines. Classifiers were externally validated on “pathogenic” and “benign” ClinVar variants absent from the training datasets. Software is implemented in Python and is freely available at https://github.com/djparente/polyboost and the Python Package Index (PyPI) under the BSD license. Results An XGBoost‐based classifier—designated PolyBoost (PolyPhen‐2 Booster)—improves discriminative performance and calibration relative to PolyPhen‐2 in external validation on ClinVar. Conclusions and clinical relevance PolyBoost analyzes PolyPhen‐2 output and can be incorporated into existing bioinformatics workflows as a post‐analysis method to improve interpretation of clinical exome sequences obtained to identify monogenic disease.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1862-8346 1615-9853 1862-8354 1862-8354 1615-9861
DOI:	10.1002/prca.201900124