Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity

The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools...

Full description

Saved in:

Bibliographic Details
Published in	Genes Vol. 14; no. 2; p. 248
Main Authors	Torres-Martos, Álvaro, Bustos-Aibar, Mireia, Ramírez-Mena, Alberto, Cámara-Sánchez, Sofía, Anguita-Ruiz, Augusto, Alcalá, Rafael, Aguilera, Concepción M., Alcalá-Fdez, Jesús
Format	Journal Article
Language	English
Published	Switzerland MDPI AG 18.01.2023 MDPI
Subjects	Algorithms Artificial intelligence Biochemistry Biomarkers Case studies Child childhood obesity Children Control Data analysis Data processing Diagnosis Feature selection Gene expression Genomes guidelines Humans Learning algorithms Machine Learning Medical research Medicine, Experimental Metabolism multiomics Obesity Obesity in children Pediatric Obesity prediction Prediction models Puberty Quality control Software Spain omics machine learning data pre-processing
Online Access	Get full text
ISSN	2073-4425 2073-4425
DOI	10.3390/genes14020248

Cover

More Information
Summary:	The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate pre-processing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work.
Bibliography:	ObjectType-Case Study-2 SourceType-Scholarly Journals-1 content type line 14 ObjectType-Feature-4 ObjectType-Report-1 ObjectType-Article-3 ObjectType-Article-1 ObjectType-Feature-2 content type line 23 These authors contributed equally to this work.
ISSN:	2073-4425 2073-4425
DOI:	10.3390/genes14020248