Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity

The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools...

Full description

Saved in:
Bibliographic Details
Published inGenes Vol. 14; no. 2; p. 248
Main Authors Torres-Martos, Álvaro, Bustos-Aibar, Mireia, Ramírez-Mena, Alberto, Cámara-Sánchez, Sofía, Anguita-Ruiz, Augusto, Alcalá, Rafael, Aguilera, Concepción M., Alcalá-Fdez, Jesús
Format Journal Article
LanguageEnglish
Published Switzerland MDPI AG 18.01.2023
MDPI
Subjects
Online AccessGet full text
ISSN2073-4425
2073-4425
DOI10.3390/genes14020248

Cover

More Information
Summary:The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate pre-processing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work.
Bibliography:ObjectType-Case Study-2
SourceType-Scholarly Journals-1
content type line 14
ObjectType-Feature-4
ObjectType-Report-1
ObjectType-Article-3
ObjectType-Article-1
ObjectType-Feature-2
content type line 23
These authors contributed equally to this work.
ISSN:2073-4425
2073-4425
DOI:10.3390/genes14020248