Evaluation of Artificial Intelligence-Based Gleason Grading Algorithms “in the Wild”

The biopsy Gleason score is an important prognostic marker for prostate cancer patients. It is, however, subject to substantial variability among pathologists. Artificial intelligence (AI)–based algorithms employing deep learning have shown their ability to match pathologists’ performance in assigni...

Full description

Saved in:

Bibliographic Details
Published in	Modern pathology Vol. 37; no. 11; p. 100563
Main Authors	Faryna, Khrystyna, Tessier, Leslie, Retamero, Juan, Bonthu, Saikiran, Samanta, Pranab, Singhal, Nitin, Kammerer-Jacquet, Solene-Florence, Radulescu, Camelia, Agosti, Vittorio, Collin, Alexandre, Farre´, Xavier, Fontugne, Jacqueline, Grobholz, Rainer, Hoogland, Agnes Marije, Moreira Leite, Katia Ramos, Oktay, Murat, Polonia, Antonio, Roy, Paromita, Salles, Paulo Guilherme, van der Kwast, Theodorus H., van Ipenburg, Jolique, van der Laak, Jeroen, Litjens, Geert
Format	Journal Article
Language	English
Published	United States Elsevier Inc 01.11.2024 Nature Publishing Group: Open Access Hybrid Model Option B
Subjects	Algorithms Artificial Intelligence computational pathology Deep Learning Gleason grading Humans Life Sciences Male Neoplasm Grading - methods Prostatic Neoplasms - pathology Santé publique et épidémiologie deep learning computational pathology Gleason grading artificial intelligence
Online Access	Get full text
ISSN	0893-3952 1530-0285 1530-0285
DOI	10.1016/j.modpat.2024.100563

Cover

More Information
Summary:	The biopsy Gleason score is an important prognostic marker for prostate cancer patients. It is, however, subject to substantial variability among pathologists. Artificial intelligence (AI)–based algorithms employing deep learning have shown their ability to match pathologists’ performance in assigning Gleason scores, with the potential to enhance pathologists’ grading accuracy. The performance of Gleason AI algorithms in research is mostly reported on common benchmark data sets or within public challenges. In contrast, many commercial algorithms are evaluated in clinical studies, for which data are not publicly released. As commercial AI vendors typically do not publish performance on public benchmarks, comparison between research and commercial AI is difficult. The aims of this study are to evaluate and compare the performance of top-ranked public and commercial algorithms using real-world data. We curated a diverse data set of whole-slide prostate biopsy images through crowdsourcing containing images with a range of Gleason scores and from diverse sources. Predictions were obtained from 5 top-ranked public algorithms from the Prostate cANcer graDe Assessment (PANDA) challenge and 2 commercial Gleason grading algorithms. Additionally, 10 pathologists (A.C., C.R., J.v.I., K.R.M.L., P.R., P.G.S., R.G., S.F.K.J., T.v.d.K., X.F.) evaluated the data set in a reader study. Overall, the pairwise quadratic weighted kappa among pathologists ranged from 0.777 to 0.916. Both public and commercial algorithms showed high agreement with pathologists, with quadratic kappa ranging from 0.617 to 0.900. Commercial algorithms performed on par or outperformed top public algorithms.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0893-3952 1530-0285 1530-0285
DOI:	10.1016/j.modpat.2024.100563