Designing image segmentation studies: Statistical power, sample size and reference standard quality

•A sample size calculation for segmentation accuracy studies is derived.•Parameters include accuracy difference, algorithm disagreement and a design factor.•A formula is derived to account for errors in the study reference standard.•A case study illustrates the application of the theory to a segment...

Full description

Saved in:

Bibliographic Details
Published in	Medical image analysis Vol. 42; pp. 44 - 59
Main Authors	Gibson, Eli, Hu, Yipeng, Huisman, Henkjan J., Barratt, Dean C.
Format	Journal Article
Language	English
Published	Netherlands Elsevier B.V 01.12.2017 Elsevier BV Elsevier
Subjects	Accuracy Algorithms Case studies Computer simulation Confidence intervals Humans Image Enhancement - methods Image Interpretation, Computer-Assisted - methods Image processing Image segmentation Imaging, Three-Dimensional Magnetic Resonance Imaging - methods Male Mathematical models Medical imaging Models, Statistical Monte Carlo simulation Prostate Prostatic Neoplasms - diagnostic imaging Reference standard Reference Standards Reproducibility of Results Resampling Sample Size Segmentation accuracy Sensitivity and Specificity Statistical power Statistics Studies Image segmentation Segmentation accuracy Reference standard Statistical power
Online Access	Get full text
ISSN	1361-8415 1361-8423 1361-8431 1361-8423
DOI	10.1016/j.media.2017.07.004

Cover

More Information
Summary:	•A sample size calculation for segmentation accuracy studies is derived.•Parameters include accuracy difference, algorithm disagreement and a design factor.•A formula is derived to account for errors in the study reference standard.•A case study illustrates the application of the theory to a segmentation study design. [Display omitted] Segmentation algorithms are typically evaluated by comparison to an accepted reference standard. The cost of generating accurate reference standards for medical image segmentation can be substantial. Since the study cost and the likelihood of detecting a clinically meaningful difference in accuracy both depend on the size and on the quality of the study reference standard, balancing these trade-offs supports the efficient use of research resources. In this work, we derive a statistical power calculation that enables researchers to estimate the appropriate sample size to detect clinically meaningful differences in segmentation accuracy (i.e. the proportion of voxels matching the reference standard) between two algorithms. Furthermore, we derive a formula to relate reference standard errors to their effect on the sample sizes of studies using lower-quality (but potentially more affordable and practically available) reference standards. The accuracy of the derived sample size formula was estimated through Monte Carlo simulation, demonstrating, with 95% confidence, a predicted statistical power within 4% of simulated values across a range of model parameters. This corresponds to sample size errors of less than 4 subjects and errors in the detectable accuracy difference less than 0.6%. The applicability of the formula to real-world data was assessed using bootstrap resampling simulations for pairs of algorithms from the PROMISE12 prostate MR segmentation challenge data set. The model predicted the simulated power for the majority of algorithm pairs within 4% for simulated experiments using a high-quality reference standard and within 6% for simulated experiments using a low-quality reference standard. A case study, also based on the PROMISE12 data, illustrates using the formulae to evaluate whether to use a lower-quality reference standard in a prostate segmentation study.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1361-8415 1361-8423 1361-8431 1361-8423
DOI:	10.1016/j.media.2017.07.004