Analysis of follow-up data in large biobank cohorts: a review of methodology

This study focuses on key methodological challenges in genome-wide association studies (GWAS) of biobank data with time-to-event outcomes, analyzed using the Cox proportional hazards (CPH) model. We address four primary issues: left-truncation of the data, computational inefficiency of standard mode...

Full description

Saved in:
Bibliographic Details
Published inFrontiers in genetics Vol. 16; p. 1534726
Main Authors Kolde, Anastassia, Koitmäe, Merli, Käärik, Meelis, Möls, Märt, Fischer, Krista
Format Journal Article
LanguageEnglish
Published Switzerland Frontiers Media S.A 25.06.2025
Subjects
Online AccessGet full text
ISSN1664-8021
1664-8021
DOI10.3389/fgene.2025.1534726

Cover

More Information
Summary:This study focuses on key methodological challenges in genome-wide association studies (GWAS) of biobank data with time-to-event outcomes, analyzed using the Cox proportional hazards (CPH) model. We address four primary issues: left-truncation of the data, computational inefficiency of standard model-fitting algorithms, relatedness among individuals, and model misspecification. To manage left-truncation, the common practice is to use age as the timescale, with individuals entering the risk set at their age of recruitment. We assess how this choice of timescale influences bias and statistical power, under realistic GWAS conditions of varying effect sizes and censoring rates. In addition, to alleviate the computational burden typical in large-scale data, we propose and evaluate a two-step martingale residual (MR) approach for high-dimensional CPH modeling. Our results show that the timescale choice has minimal effect on accuracy for small hazard ratios, though using time since birth as the timescale – ignoring recruitment age – yields the highest power for association detection. We find that relatedness, when ignored, does not substantially bias effect size estimates, while omitting key covariates introduces significant bias. The two-step MR approach proves to be computationally efficient, retaining power for detecting small effect sizes, making it suitable for large-scale association studies. However, when precise effect size estimates are critical, particularly for moderate or larger effect sizes, we recommend recalculating these estimates using the conventional CPH model, with careful attention to left-truncation and relatedness. These conclusions are drawn from simulations and illustrated with data from the Estonian Biobank cohort.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
Silke Szymczak, University of Lübeck, Germany
Reviewed by: Ghislain Rocheleau, Icahn School of Medicine at Mount Sinai, United States
These authors have contributed equally to this work and share first authorship
Matthew Zawistowski, University of Michigan, United States
Edited by: Sebastian Zöllner, University of Michigan, United States
ISSN:1664-8021
1664-8021
DOI:10.3389/fgene.2025.1534726