Using artificial intelligence (AI) to model clinical variant reporting for next generation sequencing (NGS) oncology assays

Background Targeted next generation sequencing (NGS) of somatic DNA is now routinely used for diagnostic and predictive reporting in the oncology clinic. The expert genomic analysis required for NGS assays remains a bottleneck to scaling the volume of patients being assessed. This study harnesses da...

Full description

Saved in:

Bibliographic Details
Published in	BioData mining Vol. 18; no. 1
Main Authors	Doig, Kenneth D., Perera, Rashindrie, Kankanige, Yamuna, Fellowes, Andrew, Li, Jason, Lupat, Richard, Thompson, Ella R., Blombery, Piers, Fox, Stephen B.
Format	Journal Article
Language	English
Published	London BioMed Central 29.10.2025
Subjects	Algorithms Bioinformatics Biomedical and Life Sciences Computational Biology/Bioinformatics Computer Appl. in Life Sciences Data Mining and Knowledge Discovery Life Sciences AI prediction algorithms Somatic mutations Precision oncology Clinical decision support systems Machine learning Targeted sequencing Cancer genomics CDSS Clinical diagnostics Variant calling
Online Access	Get full text
ISSN	1756-0381 1756-0381
DOI	10.1186/s13040-025-00489-y

Cover

More Information
Summary:	Background Targeted next generation sequencing (NGS) of somatic DNA is now routinely used for diagnostic and predictive reporting in the oncology clinic. The expert genomic analysis required for NGS assays remains a bottleneck to scaling the volume of patients being assessed. This study harnesses data from targeted clinical sequencing to build machine learning models that predict whether patient variants should be reported. Methods Three somatic assays were used to build machine learning prediction models using the estimators Logistic Regression, Random Forest, XGBoost and Neural Networks. Using manual expert curation to select reportable variants as ground truth, we built models to classify clinically reportable variants. Assays were performed between 2020 and 2023 yielding 1,350,018 variants and used to report on 10,116 patients. All variants, together with 211 annotations and sequencing features, were used by the models to predict the likelihood of variants being reported. Results The tree-based ensemble models performed consistently well achieving between 0.904 and 0.996 on the precision recall/area under the curve (PRC AUC) metric when predicting whether a variant should be reported. To assist model explainability, individual model predictions were presented to users within a tertiary analysis platform as a waterfall plot showing individual feature contributions and their values for the variant. Over 30% of the model performance was due to features sourced from statistics derived in-house from the sequencing assay precluding easy generalization of the models to other assays or other laboratories. Conclusions Longitudinally acquired NGS assay data provide a strong basis for machine learning models for decision support to select variants for clinical oncology reports. The models provide a framework for consistent reporting practices and reducing inter-reviewer variability. To improve model transparency, individual variant predictions are able to be presented as part of reviewer workflows.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1756-0381 1756-0381
DOI:	10.1186/s13040-025-00489-y