EasyGeSe – a resource for benchmarking genomic prediction methods
Background Genomic prediction is a widely used method to predict phenotypes from genotypic data. Advances in both biological and computer science have enabled the generation of vast amounts of data and the development of new algorithms, specifically in the field of machine learning. However, systema...
        Saved in:
      
    
          | Published in | BMC genomics Vol. 26; no. 1; p. 953 | 
|---|---|
| Main Authors | , , , | 
| Format | Journal Article | 
| Language | English | 
| Published | 
        London
          BioMed Central
    
        24.10.2025
     BioMed Central Ltd  | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 1471-2164 1471-2164  | 
| DOI | 10.1186/s12864-025-12129-0 | 
Cover
| Summary: | Background
Genomic prediction is a widely used method to predict phenotypes from genotypic data. Advances in both biological and computer science have enabled the generation of vast amounts of data and the development of new algorithms, specifically in the field of machine learning. However, systematic benchmarking of new genomic prediction methods, which is essential for objective evaluation and comparison, remains limited.
Results
Here, we present EasyGeSe, a tool that provides access to a curated collection of datasets for testing genomic prediction methods. This resource encompasses data from multiple species, including barley, common bean, lentil, loblolly pine, eastern oyster, maize, pig, rice, soybean and wheat, representing a broad biological diversity. We filtered and arranged these data in convenient formats, provided functions in R and Python for easy loading and benchmarked several modelling strategies for genomic prediction. Predictive performance, measured by Pearson’s correlation coefficient (
r
), varied significantly by species and trait (
p
 < 0.001), ranging from − 0.08 to 0.96, with a mean of 0.62. Comparisons among parametric, semi-parametric and non-parametric models revealed modest but statistically significant (
p
 < 1e
−10
) gains in accuracy for the non-parametric methods random forest (+ 0.014), LightGBM (+ 0.021) and XGBoost (+ 0.025). These methods also offered major computational advantages, with model fitting times typically an order of magnitude faster and RAM usage approximately 30% lower than Bayesian alternatives. However, these measurements do not account for the computational costs of hyperparameter tuning.
Conclusions
By standardizing input data and evaluation procedures, this resource simplifies benchmarking and enables fair, reproducible comparisons of genomic prediction methods. It also broadens access to genomic prediction data, encouraging data scientists and interdisciplinary researchers to test novel modelling strategies. | 
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23  | 
| ISSN: | 1471-2164 1471-2164  | 
| DOI: | 10.1186/s12864-025-12129-0 |