A comparative analysis of gene expression profiling by statistical and machine learning approaches

Abstract Motivation Many machine learning (ML) models developed to classify phenotype from gene expression data provide interpretations for their decisions, with the aim of understanding biological processes. For many models, including neural networks, interpretations are lists of genes ranked by th...

Full description

Saved in:
Bibliographic Details
Published inBioinformatics advances Vol. 5; no. 1; p. vbae199
Main Authors Bontonou, Myriam, Haget, Anaïs, Boulougouri, Maria, Audit, Benjamin, Borgnat, Pierre, Arbona, Jean-Michel
Format Journal Article
LanguageEnglish
Published England Oxford University Press 2025
Oxford academic
Subjects
Online AccessGet full text
ISSN2635-0041
2635-0041
DOI10.1093/bioadv/vbae199

Cover

More Information
Summary:Abstract Motivation Many machine learning (ML) models developed to classify phenotype from gene expression data provide interpretations for their decisions, with the aim of understanding biological processes. For many models, including neural networks, interpretations are lists of genes ranked by their importance for the predictions, with top-ranked genes likely linked to the phenotype. In this article, we discuss the limitations of such approaches using integrated gradient, an explainability method developed for neural networks, as an example. Results Experiments are performed on RNA sequencing data from public cancer databases. A collection of ML models, including multilayer perceptrons and graph neural networks, are trained to classify samples by cancer type. Gene rankings from integrated gradients are compared to genes highlighted by statistical feature selection methods such as DESeq2 and other learning methods measuring global feature contribution. Experiments show that a small set of top-ranked genes is sufficient to achieve good classification. However, similar performance is possible with lower-ranked genes, although larger sets are required. Moreover, significant differences in top-ranked genes, especially between statistical and learning methods, prevent a comprehensive biological understanding. In conclusion, while these methods identify pathology-specific biomarkers, the completeness of gene sets selected by explainability techniques for understanding biological processes remains uncertain. Availability and implementation Python code and datasets are available at https://github.com/mbonto/XAI_in_genomics.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:2635-0041
2635-0041
DOI:10.1093/bioadv/vbae199