Designing image segmentation studies: Statistical power, sample size and reference standard quality

•A sample size calculation for segmentation accuracy studies is derived.•Parameters include accuracy difference, algorithm disagreement and a design factor.•A formula is derived to account for errors in the study reference standard.•A case study illustrates the application of the theory to a segment...

Full description

Saved in:
Bibliographic Details
Published inMedical image analysis Vol. 42; pp. 44 - 59
Main Authors Gibson, Eli, Hu, Yipeng, Huisman, Henkjan J., Barratt, Dean C.
Format Journal Article
LanguageEnglish
Published Netherlands Elsevier B.V 01.12.2017
Elsevier BV
Elsevier
Subjects
Online AccessGet full text
ISSN1361-8415
1361-8423
1361-8431
1361-8423
DOI10.1016/j.media.2017.07.004

Cover

More Information
Summary:•A sample size calculation for segmentation accuracy studies is derived.•Parameters include accuracy difference, algorithm disagreement and a design factor.•A formula is derived to account for errors in the study reference standard.•A case study illustrates the application of the theory to a segmentation study design. [Display omitted] Segmentation algorithms are typically evaluated by comparison to an accepted reference standard. The cost of generating accurate reference standards for medical image segmentation can be substantial. Since the study cost and the likelihood of detecting a clinically meaningful difference in accuracy both depend on the size and on the quality of the study reference standard, balancing these trade-offs supports the efficient use of research resources. In this work, we derive a statistical power calculation that enables researchers to estimate the appropriate sample size to detect clinically meaningful differences in segmentation accuracy (i.e. the proportion of voxels matching the reference standard) between two algorithms. Furthermore, we derive a formula to relate reference standard errors to their effect on the sample sizes of studies using lower-quality (but potentially more affordable and practically available) reference standards. The accuracy of the derived sample size formula was estimated through Monte Carlo simulation, demonstrating, with 95% confidence, a predicted statistical power within 4% of simulated values across a range of model parameters. This corresponds to sample size errors of less than 4 subjects and errors in the detectable accuracy difference less than 0.6%. The applicability of the formula to real-world data was assessed using bootstrap resampling simulations for pairs of algorithms from the PROMISE12 prostate MR segmentation challenge data set. The model predicted the simulated power for the majority of algorithm pairs within 4% for simulated experiments using a high-quality reference standard and within 6% for simulated experiments using a low-quality reference standard. A case study, also based on the PROMISE12 data, illustrates using the formulae to evaluate whether to use a lower-quality reference standard in a prostate segmentation study.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:1361-8415
1361-8423
1361-8431
1361-8423
DOI:10.1016/j.media.2017.07.004