HABiC: an algorithm based on the exact computation of the Kantorovich-Rubinstein optimizer for binary classification in transcriptomics

Machine learning analyses of molecular omics datasets largely drive the development of precision medicine in oncology, but mathematical challenges still hamper their application in the clinic. In particular, omics-based learning relies on high dimensional data with high degrees of freedom and multic...

Full description

Saved in:
Bibliographic Details
Published inBioinformatics (Oxford, England) Vol. 41; no. 6
Main Authors Cordier, Chiara, Jézéquel, Pascal, Campone, Mario, Panloup, Fabien, Basseville, Agnes
Format Journal Article
LanguageEnglish
Published England Oxford Publishing Limited (England) 01.06.2025
Oxford University Press
Subjects
Online AccessGet full text
ISSN1367-4811
1367-4803
1367-4811
DOI10.1093/bioinformatics/btaf310

Cover

More Information
Summary:Machine learning analyses of molecular omics datasets largely drive the development of precision medicine in oncology, but mathematical challenges still hamper their application in the clinic. In particular, omics-based learning relies on high dimensional data with high degrees of freedom and multicollinearity issues, requiring more tailored algorithms. Here, we have developed a prediction algorithm that relies on the 1-Wasserstein distance to better capture complex relationships between variables, and that is built on a decision rule based on the exact computation of the Kantorovich-Rubinstein optimizer to increase the algorithm precision. We explored dimension reduction and aggregation methods to improve its robustness. The exact method was compared with a neural network-based approximate method, as well as with standard Euclidean distance-based classifiers. Experimental results on synthetic datasets with multiple scenarios of redundant/informative variables revealed that exact and approximate methods based on Wasserstein distance outperformed state-of-the-art algorithms when class information was spread across a large number of variables. When predicting clinical or biological outcomes from transcriptomics datasets, HABiC achieved consistently higher accuracy in most situations. Python code for the HABiC classifier is available at https://github.com/chiaraco/HABiC.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:1367-4811
1367-4803
1367-4811
DOI:10.1093/bioinformatics/btaf310