EasyGeSe – a resource for benchmarking genomic prediction methods

Background Genomic prediction is a widely used method to predict phenotypes from genotypic data. Advances in both biological and computer science have enabled the generation of vast amounts of data and the development of new algorithms, specifically in the field of machine learning. However, systema...

Full description

Saved in:
Bibliographic Details
Published inBMC genomics Vol. 26; no. 1; p. 953
Main Authors Quesada-Traver, Carles, Ariza-Suarez, Daniel, Studer, Bruno, Yates, Steven
Format Journal Article
LanguageEnglish
Published London BioMed Central 24.10.2025
BioMed Central Ltd
Subjects
Online AccessGet full text
ISSN1471-2164
1471-2164
DOI10.1186/s12864-025-12129-0

Cover

More Information
Summary:Background Genomic prediction is a widely used method to predict phenotypes from genotypic data. Advances in both biological and computer science have enabled the generation of vast amounts of data and the development of new algorithms, specifically in the field of machine learning. However, systematic benchmarking of new genomic prediction methods, which is essential for objective evaluation and comparison, remains limited. Results Here, we present EasyGeSe, a tool that provides access to a curated collection of datasets for testing genomic prediction methods. This resource encompasses data from multiple species, including barley, common bean, lentil, loblolly pine, eastern oyster, maize, pig, rice, soybean and wheat, representing a broad biological diversity. We filtered and arranged these data in convenient formats, provided functions in R and Python for easy loading and benchmarked several modelling strategies for genomic prediction. Predictive performance, measured by Pearson’s correlation coefficient ( r ), varied significantly by species and trait ( p  < 0.001), ranging from − 0.08 to 0.96, with a mean of 0.62. Comparisons among parametric, semi-parametric and non-parametric models revealed modest but statistically significant ( p  < 1e −10 ) gains in accuracy for the non-parametric methods random forest (+ 0.014), LightGBM (+ 0.021) and XGBoost (+ 0.025). These methods also offered major computational advantages, with model fitting times typically an order of magnitude faster and RAM usage approximately 30% lower than Bayesian alternatives. However, these measurements do not account for the computational costs of hyperparameter tuning. Conclusions By standardizing input data and evaluation procedures, this resource simplifies benchmarking and enables fair, reproducible comparisons of genomic prediction methods. It also broadens access to genomic prediction data, encouraging data scientists and interdisciplinary researchers to test novel modelling strategies.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1471-2164
1471-2164
DOI:10.1186/s12864-025-12129-0