Edge roughness quantifies impact of physician variation on training and performance of deep learning auto-segmentation models for the esophagus
Manual segmentation of tumors and organs-at-risk (OAR) in 3D imaging for radiation-therapy planning is time-consuming and subject to variation between different observers. Artificial intelligence (AI) can assist with segmentation, but challenges exist in ensuring high-quality segmentation, especiall...
Saved in:
| Published in | Scientific reports Vol. 14; no. 1; pp. 2536 - 11 |
|---|---|
| Main Authors | , , , , , , , , |
| Format | Journal Article |
| Language | English |
| Published |
London
Nature Publishing Group UK
30.01.2024
Nature Publishing Group Nature Portfolio |
| Subjects | |
| Online Access | Get full text |
| ISSN | 2045-2322 2045-2322 |
| DOI | 10.1038/s41598-023-50382-z |
Cover
| Summary: | Manual segmentation of tumors and organs-at-risk (OAR) in 3D imaging for radiation-therapy planning is time-consuming and subject to variation between different observers. Artificial intelligence (AI) can assist with segmentation, but challenges exist in ensuring high-quality segmentation, especially for small, variable structures, such as the esophagus. We investigated the effect of variation in segmentation quality and style of physicians for training deep-learning models for esophagus segmentation and proposed a new metric, edge roughness, for evaluating/quantifying slice-to-slice inconsistency. This study includes a real-world cohort of 394 patients who each received radiation therapy (mainly for lung cancer). Segmentation of the esophagus was performed by 8 physicians as part of routine clinical care. We evaluated manual segmentation by comparing the length and edge roughness of segmentations among physicians to analyze inconsistencies. We trained eight multiple- and individual-physician segmentation models in total, based on U-Net architectures and residual backbones. We used the volumetric Dice coefficient to measure the performance for each model. We proposed a metric, edge roughness, to quantify the shift of segmentation among adjacent slices by calculating the curvature of edges of the 2D sagittal- and coronal-view projections. The auto-segmentation model trained on multiple physicians (MD1-7) achieved the highest mean Dice of 73.7 ± 14.8%. The individual-physician model (MD7) with the highest edge roughness (mean ± SD: 0.106 ± 0.016) demonstrated significantly lower volumetric Dice for test cases compared with other individual models (MD7: 58.5 ± 15.8%, MD6: 67.1 ± 16.8%,
p
< 0.001). A multiple-physician model trained after removing the MD7 data resulted in fewer outliers (e.g., Dice ≤ 40%: 4 cases for MD1-6, 7 cases for MD1-7, N
total
= 394). While we initially detected this pattern in a single clinician, we validated the edge roughness metric across the entire dataset. The model trained with the lowest-quantile edge roughness (MD
ER
-Q1, N
train
= 62) achieved significantly higher Dice (N
test
= 270) than the model trained with the highest-quantile ones (MD
ER
-Q4, N
train
= 62) (MD
ER
-Q1: 67.8 ± 14.8%, MD
ER
-Q4: 62.8 ± 15.7%,
p
< 0.001). This study demonstrates that there is significant variation in style and quality in manual segmentations in clinical care, and that training AI auto-segmentation algorithms from real-world, clinical datasets may result in unexpectedly under-performing algorithms with the inclusion of outliers. Importantly, this study provides a novel evaluation metric, edge roughness, to quantify physician variation in segmentation which will allow developers to filter clinical training data to optimize model performance. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
| ISSN: | 2045-2322 2045-2322 |
| DOI: | 10.1038/s41598-023-50382-z |