Performance Evaluation of Deep Learning for the Detection and Segmentation of Thyroid Nodules: Systematic Review and Meta-Analysis

Thyroid cancer is one of the most common endocrine malignancies. Its incidence has steadily increased in recent years. Distinguishing between benign and malignant thyroid nodules (TNs) is challenging due to their overlapping imaging features. The rapid advancement of artificial intelligence (AI) in...

Full description

Saved in:

Bibliographic Details
Published in	Journal of medical Internet research Vol. 27; p. e73516
Main Authors	Ni, Jiayu, You, Yue, Wu, Xiaohe, Chen, Xueke, Wang, Jiaying, Li, Yuan
Format	Journal Article
Language	English
Published	Canada JMIR Publications 14.08.2025
Subjects	AI Applications in Biomedical Engineering Algorithms Applications of AI Artificial Intelligence Computer-Aided Diagnosis Deep Learning Humans Imaging Informatics Machine Learning Medical Imaging Review Thyroid Neoplasms - diagnostic imaging Thyroid Nodule - diagnosis Thyroid Nodule - diagnostic imaging Preferred Reporting Items for Systematic reviews and Meta-Analyses PRISMA diagnostic performance thyroid imaging systematic review artificial intelligence sensitivity and specificity
Online Access	Get full text
ISSN	1438-8871 1439-4456 1438-8871
DOI	10.2196/73516

Cover

More Information
Summary:	Thyroid cancer is one of the most common endocrine malignancies. Its incidence has steadily increased in recent years. Distinguishing between benign and malignant thyroid nodules (TNs) is challenging due to their overlapping imaging features. The rapid advancement of artificial intelligence (AI) in medical image analysis, particularly deep learning (DL) algorithms, has provided novel solutions for automated TN detection. However, existing studies exhibit substantial heterogeneity in diagnostic performance. Furthermore, no systematic evidence-based research comprehensively assesses the diagnostic performance of DL models in this field. This study aimed to execute a systematic review and meta-analysis to appraise the performance of DL algorithms in diagnosing TN malignancy, identify key factors influencing their diagnostic efficacy, and compare their accuracy with that of clinicians in image-based diagnosis. We systematically searched multiple databases, including PubMed, Cochrane, Embase, Web of Science, and IEEE, and identified 41 eligible studies for systematic review and meta-analysis. Based on the task type, studies were categorized into segmentation (n=14) and detection (n=27) tasks. The pooled sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC) were calculated for each group. Subgroup analyses were performed to examine the impact of transfer learning and compare model performance against clinicians. For segmentation tasks, the pooled sensitivity, specificity, and AUC were 82% (95% CI 79%-84%), 95% (95% CI 92%-96%), and 0.91 (95% CI 0.89-0.94), respectively. For detection tasks, the pooled sensitivity, specificity, and AUC were 91% (95% CI 89%-93%), 89% (95% CI 86%-91%), and 0.96 (95% CI 0.93-0.97), respectively. Some studies demonstrated that DL models could achieve diagnostic performance comparable with, or even exceeding, that of clinicians in certain scenarios. The application of transfer learning contributed to improved model performance. DL algorithms exhibit promising diagnostic accuracy in TN imaging, highlighting their potential as auxiliary diagnostic tools. However, current studies are limited by suboptimal methodological design, inconsistent image quality across datasets, and insufficient external validation, which may introduce bias. Future research should enhance methodological standardization, improve model interpretability, and promote transparent reporting to facilitate the sustainable clinical translation of DL-based solutions.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 ObjectType-Review-3 content type line 23 these authors contributed equally
ISSN:	1438-8871 1439-4456 1438-8871
DOI:	10.2196/73516