EasyGeSe – a resource for benchmarking genomic prediction methods

Background Genomic prediction is a widely used method to predict phenotypes from genotypic data. Advances in both biological and computer science have enabled the generation of vast amounts of data and the development of new algorithms, specifically in the field of machine learning. However, systema...

Full description

Saved in:

Bibliographic Details
Published in	BMC genomics Vol. 26; no. 1; p. 953
Main Authors	Quesada-Traver, Carles, Ariza-Suarez, Daniel, Studer, Bruno, Yates, Steven
Format	Journal Article
Language	English
Published	London BioMed Central 24.10.2025 BioMed Central Ltd
Subjects	Algorithms Animal Genetics and Genomics Animals Benchmarking Benchmarks Biomedical and Life Sciences Computer science Databases, Genetic Diseases and pests Genetic aspects Genomics - methods Growth Health aspects Life Sciences Machine learning Methods Microarrays Microbial Genetics and Genomics Phenotype Plant Genetics and Genomics Proteomics Quantitative genetics Software Soybean Wheat Switzerland Genomic prediction Quantitative genetics Database Benchmarking Genomic selection Machine learning
Online Access	Get full text
ISSN	1471-2164 1471-2164
DOI	10.1186/s12864-025-12129-0

Cover

More Information
Summary:	Background Genomic prediction is a widely used method to predict phenotypes from genotypic data. Advances in both biological and computer science have enabled the generation of vast amounts of data and the development of new algorithms, specifically in the field of machine learning. However, systematic benchmarking of new genomic prediction methods, which is essential for objective evaluation and comparison, remains limited. Results Here, we present EasyGeSe, a tool that provides access to a curated collection of datasets for testing genomic prediction methods. This resource encompasses data from multiple species, including barley, common bean, lentil, loblolly pine, eastern oyster, maize, pig, rice, soybean and wheat, representing a broad biological diversity. We filtered and arranged these data in convenient formats, provided functions in R and Python for easy loading and benchmarked several modelling strategies for genomic prediction. Predictive performance, measured by Pearson’s correlation coefficient ( r ), varied significantly by species and trait ( p < 0.001), ranging from − 0.08 to 0.96, with a mean of 0.62. Comparisons among parametric, semi-parametric and non-parametric models revealed modest but statistically significant ( p < 1e −10 ) gains in accuracy for the non-parametric methods random forest (+ 0.014), LightGBM (+ 0.021) and XGBoost (+ 0.025). These methods also offered major computational advantages, with model fitting times typically an order of magnitude faster and RAM usage approximately 30% lower than Bayesian alternatives. However, these measurements do not account for the computational costs of hyperparameter tuning. Conclusions By standardizing input data and evaluation procedures, this resource simplifies benchmarking and enables fair, reproducible comparisons of genomic prediction methods. It also broadens access to genomic prediction data, encouraging data scientists and interdisciplinary researchers to test novel modelling strategies.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1471-2164 1471-2164
DOI:	10.1186/s12864-025-12129-0