External validation of a publicly available computer assisted diagnostic tool for mammographic mass lesions with two high prevalence research datasets

Purpose: Lesions detected at mammography are described with a highly standardized terminology: the breast imaging‐reporting and data system (BI‐RADS) lexicon. Up to now, no validated semantic computer assisted classification algorithm exists to interactively link combinations of morphological descri...

Full description

Saved in:

Bibliographic Details
Published in	Medical physics (Lancaster) Vol. 42; no. 8; pp. 4987 - 4996
Main Authors	Benndorf, Matthias, Burnside, Elizabeth S., Herda, Christoph, Langer, Mathias, Kotter, Elmar
Format	Journal Article
Language	English
Published	United States American Association of Physicists in Medicine 01.08.2015
Subjects	Access to Information Algorithms Area Under Curve Bayes methods Bayes Theorem Bayesian statistics Biological material, e.g. blood, urine; Haemocytometers BI‐RADS Breast Neoplasms - diagnosis Breast Neoplasms - diagnostic imaging CADx Calibration Cancer Computer aided diagnosis Computer modeling data analysis Databases, Factual Diagnosis, Differential Digital mammography Humans mammography Mammography - methods Probability theory Radiation Imaging Physics Radiographic Image Interpretation, Computer-Assisted - methods Radiologists ROC Curve
Online Access	Get full text
ISSN	0094-2405 2473-4209 1522-8541 2473-4209
DOI	10.1118/1.4927260

Cover

More Information
Summary:	Purpose: Lesions detected at mammography are described with a highly standardized terminology: the breast imaging‐reporting and data system (BI‐RADS) lexicon. Up to now, no validated semantic computer assisted classification algorithm exists to interactively link combinations of morphological descriptors from the lexicon to a probabilistic risk estimate of malignancy. The authors therefore aim at the external validation of the mammographic mass diagnosis (MMassDx) algorithm. A classification algorithm like MMassDx must perform well in a variety of clinical circumstances and in datasets that were not used to generate the algorithm in order to ultimately become accepted in clinical routine. Methods: The MMassDx algorithm uses a naïve Bayes network and calculates post‐test probabilities of malignancy based on two distinct sets of variables, (a) BI‐RADS descriptors and age (“descriptor model”) and (b) BI‐RADS descriptors, age, and BI‐RADS assessment categories (“inclusive model”). The authors evaluate both the MMassDx (descriptor) and MMassDx (inclusive) models using two large publicly available datasets of mammographic mass lesions: the digital database for screening mammography (DDSM) dataset, which contains two subsets from the same examinations—a medio–lateral oblique (MLO) view and cranio–caudal (CC) view dataset—and the mammographic mass (MM) dataset. The DDSM contains 1220 mass lesions and the MM dataset contains 961 mass lesions. The authors evaluate discriminative performance using area under the receiver‐operating‐characteristic curve (AUC) and compare this to the BI‐RADS assessment categories alone (i.e., the clinical performance) using the DeLong method. The authors also evaluate whether assigned probabilistic risk estimates reflect the lesions’ true risk of malignancy using calibration curves. Results: The authors demonstrate that the MMassDx algorithms show good discriminatory performance. AUC for the MMassDx (descriptor) model in the DDSM data is 0.876/0.895 (MLO/CC view) and AUC for the MMassDx (inclusive) model in the DDSM data is 0.891/0.900 (MLO/CC view). AUC for the MMassDx (descriptor) model in the MM data is 0.862 and AUC for the MMassDx (inclusive) model in the MM data is 0.900. In all scenarios, MMassDx performs significantly better than clinical performance, P < 0.05 each. The authors furthermore demonstrate that the MMassDx algorithm systematically underestimates the risk of malignancy in the DDSM and MM datasets, especially when low probabilities of malignancy are assigned. Conclusions: The authors’ results reveal that the MMassDx algorithms have good discriminatory performance but less accurate calibration when tested on two independent validation datasets. Improvement in calibration and testing in a prospective clinical population will be important steps in the pursuit of translation of these algorithms to the clinic.
Bibliography:	matthias.benndorf@uniklinik‐freiburg.de Author to whom correspondence should be addressed. Electronic mail Author to whom correspondence should be addressed. Electronic mail: matthias.benndorf@uniklinik-freiburg.de
ISSN:	0094-2405 2473-4209 1522-8541 2473-4209
DOI:	10.1118/1.4927260