HABiC: an algorithm based on the exact computation of the Kantorovich-Rubinstein optimizer for binary classification in transcriptomics

Machine learning analyses of molecular omics datasets largely drive the development of precision medicine in oncology, but mathematical challenges still hamper their application in the clinic. In particular, omics-based learning relies on high dimensional data with high degrees of freedom and multic...

Full description

Saved in:

Bibliographic Details
Published in	Bioinformatics (Oxford, England) Vol. 41; no. 6
Main Authors	Cordier, Chiara, Jézéquel, Pascal, Campone, Mario, Panloup, Fabien, Basseville, Agnes
Format	Journal Article
Language	English
Published	England Oxford Publishing Limited (England) 01.06.2025 Oxford University Press
Subjects	Algorithms Availability Complex variables Computation Computational Biology - methods Datasets Euclidean geometry Gene Expression Profiling - methods Humans Machine Learning Neural networks Neural Networks, Computer Original Paper Precision medicine Software Synthetic data Transcriptome Transcriptomics
Online Access	Get full text
ISSN	1367-4811 1367-4803 1367-4811
DOI	10.1093/bioinformatics/btaf310

Cover

More Information
Summary:	Machine learning analyses of molecular omics datasets largely drive the development of precision medicine in oncology, but mathematical challenges still hamper their application in the clinic. In particular, omics-based learning relies on high dimensional data with high degrees of freedom and multicollinearity issues, requiring more tailored algorithms. Here, we have developed a prediction algorithm that relies on the 1-Wasserstein distance to better capture complex relationships between variables, and that is built on a decision rule based on the exact computation of the Kantorovich-Rubinstein optimizer to increase the algorithm precision. We explored dimension reduction and aggregation methods to improve its robustness. The exact method was compared with a neural network-based approximate method, as well as with standard Euclidean distance-based classifiers. Experimental results on synthetic datasets with multiple scenarios of redundant/informative variables revealed that exact and approximate methods based on Wasserstein distance outperformed state-of-the-art algorithms when class information was spread across a large number of variables. When predicting clinical or biological outcomes from transcriptomics datasets, HABiC achieved consistently higher accuracy in most situations. Python code for the HABiC classifier is available at https://github.com/chiaraco/HABiC.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1367-4811 1367-4803 1367-4811
DOI:	10.1093/bioinformatics/btaf310