A comparative analysis of gene expression profiling by statistical and machine learning approaches

Abstract Motivation Many machine learning (ML) models developed to classify phenotype from gene expression data provide interpretations for their decisions, with the aim of understanding biological processes. For many models, including neural networks, interpretations are lists of genes ranked by th...

Full description

Saved in:

Bibliographic Details
Published in	Bioinformatics advances Vol. 5; no. 1; p. vbae199
Main Authors	Bontonou, Myriam, Haget, Anaïs, Boulougouri, Maria, Audit, Benjamin, Borgnat, Pierre, Arbona, Jean-Michel
Format	Journal Article
Language	English
Published	England Oxford University Press 2025 Oxford academic
Subjects	Artificial Intelligence Bioinformatics Computer Science Life Sciences Quantitative Methods Signal and Image Processing Genomics (q-bio.GN) FOS: Biological sciences FOS: Computer and information sciences Machine Learning (cs.LG)
Online Access	Get full text
ISSN	2635-0041 2635-0041
DOI	10.1093/bioadv/vbae199

Cover

More Information
Summary:	Abstract Motivation Many machine learning (ML) models developed to classify phenotype from gene expression data provide interpretations for their decisions, with the aim of understanding biological processes. For many models, including neural networks, interpretations are lists of genes ranked by their importance for the predictions, with top-ranked genes likely linked to the phenotype. In this article, we discuss the limitations of such approaches using integrated gradient, an explainability method developed for neural networks, as an example. Results Experiments are performed on RNA sequencing data from public cancer databases. A collection of ML models, including multilayer perceptrons and graph neural networks, are trained to classify samples by cancer type. Gene rankings from integrated gradients are compared to genes highlighted by statistical feature selection methods such as DESeq2 and other learning methods measuring global feature contribution. Experiments show that a small set of top-ranked genes is sufficient to achieve good classification. However, similar performance is possible with lower-ranked genes, although larger sets are required. Moreover, significant differences in top-ranked genes, especially between statistical and learning methods, prevent a comprehensive biological understanding. In conclusion, while these methods identify pathology-specific biomarkers, the completeness of gene sets selected by explainability techniques for understanding biological processes remains uncertain. Availability and implementation Python code and datasets are available at https://github.com/mbonto/XAI_in_genomics.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2635-0041 2635-0041
DOI:	10.1093/bioadv/vbae199