Effects of Label Noise on Deep Learning-Based Skin Cancer Classification

Recent studies have shown that deep learning is capable of classifying dermatoscopic images at least as well as dermatologists. However, many studies in skin cancer classification utilize non-biopsy-verified training images. This imperfect ground truth introduces a systematic error, but the effects...

Full description

Saved in:

Bibliographic Details
Published in	Frontiers in medicine Vol. 7; p. 177
Main Authors	Hekler, Achim, Kather, Jakob N., Krieghoff-Henning, Eva, Utikal, Jochen S., Meier, Friedegund, Gellrich, Frank F., Upmeier zu Belzen, Julius, French, Lars, Schlager, Justin G., Ghoreschi, Kamran, Wilhelm, Tabea, Kutzner, Heinz, Berking, Carola, Heppt, Markus V., Haferkamp, Sebastian, Sondermann, Wiebke, Schadendorf, Dirk, Schilling, Bastian, Izar, Benjamin, Maron, Roman, Schmitt, Max, Fröhling, Stefan, Lipka, Daniel B., Brinker, Titus J.
Format	Journal Article
Language	English
Published	Switzerland Frontiers Media S.A 06.05.2020
Subjects	artificial intelligence dermatology label noise Medicine melanoma nevi skin cancer label noise skin cancer melanoma nevi dermatology artificial intelligence
Online Access	Get full text
ISSN	2296-858X 2296-858X
DOI	10.3389/fmed.2020.00177

Cover

More Information
Summary:	Recent studies have shown that deep learning is capable of classifying dermatoscopic images at least as well as dermatologists. However, many studies in skin cancer classification utilize non-biopsy-verified training images. This imperfect ground truth introduces a systematic error, but the effects on classifier performance are currently unknown. Here, we systematically examine the effects of label noise by training and evaluating convolutional neural networks (CNN) with 804 images of melanoma and nevi labeled either by dermatologists or by biopsy. The CNNs are evaluated on a test set of 384 images by means of 4-fold cross validation comparing the outputs with either the corresponding dermatological or the biopsy-verified diagnosis. With identical ground truths of training and test labels, high accuracies with 75.03% (95% CI: 74.39-75.66%) for dermatological and 73.80% (95% CI: 73.10-74.51%) for biopsy-verified labels can be achieved. However, if the CNN is trained and tested with different ground truths, accuracy drops significantly to 64.53% (95% CI: 63.12-65.94%, < 0.01) on a non-biopsy-verified and to 64.24% (95% CI: 62.66-65.83%, < 0.01) on a biopsy-verified test set. In conclusion, deep learning methods for skin cancer classification are highly sensitive to label noise and future work should use biopsy-verified training images to mitigate this problem.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Edited by: Robert Gniadecki, University of Alberta, Canada This article was submitted to Dermatology, a section of the journal Frontiers in Medicine Reviewed by: Irina Khamaganova, Pirogov Russian National Research Medical University, Russia; Unni Samavedam, University of Cincinnati, United States
ISSN:	2296-858X 2296-858X
DOI:	10.3389/fmed.2020.00177