Reference standard methodology in the clinical evaluation of AI chest X-ray algorithms for lung cancer detection: A systematic review
•There is significant variation in the reference standard methodology of studies evaluating chest x-ray algorithms in thoracic malignancy.•37% of the studies included in this review had unreported ground truth components.•The selection of reference standards has implications for translation, compari...
Saved in:
| Published in | European journal of radiology Vol. 192; p. 112409 |
|---|---|
| Main Authors | , , , , , , , , |
| Format | Journal Article |
| Language | English |
| Published |
Ireland
Elsevier B.V
01.11.2025
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 0720-048X 1872-7727 1872-7727 |
| DOI | 10.1016/j.ejrad.2025.112409 |
Cover
| Summary: | •There is significant variation in the reference standard methodology of studies evaluating chest x-ray algorithms in thoracic malignancy.•37% of the studies included in this review had unreported ground truth components.•The selection of reference standards has implications for translation, comparison and reproducibility of research in this field.
Lung cancer remains the leading cause of cancer death worldwide, with early diagnosis linked to improved survival. Artificial intelligence (AI) holds promise for augmenting radiologists’ workflows in chest X-ray (CXR) interpretation, particularly for detecting thoracic malignancies. However, clinical implementation of this technology relies on robust and standardised reference standard methodology at the patient-level.
This systematic review aims to describe reference standard methodology in the clinical evaluation of CXR algorithms for lung cancer detection.
Searches targeted studies on AI CXR analysis across MEDLINE, Embase, CENTRAL, and trial registries. 2 reviewers independently screened titles and abstracts, with disagreements resolved by a 3rd reviewer. Studies lacking external validation in real-world cohorts were excluded. Bias was assessed using a modified QUADAS-2 tool, and data synthesis followed SWiM guidelines.
1,679 papers were screened with 46 papers included for full paper review. 24 different AI solutions were evaluated across a broad range of research questions. We identified significant heterogeneity in reference standard methodology, including variations in target abnormalities, reference standard modality, expert panel composition, and arbitration techniques. 25 % of reference standard parameters were inadequately reported. 66 % of included studies demonstrated high risk of bias in at least one domain.
To our knowledge, this is the first systematic description of patient-level reference standard methodology in CXR AI analysis of thoracic malignancy. To facilitate translational progress in this field, researchers undertaking evaluations of diagnostic algorithms at the patient-level should ensure that reference standards are aligned with clinical workflows and adhere to reporting guidelines. Limitations include a lack of prospective studies. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 ObjectType-Review-3 content type line 23 |
| ISSN: | 0720-048X 1872-7727 1872-7727 |
| DOI: | 10.1016/j.ejrad.2025.112409 |