Evaluation of Similarity of Image Explanations Produced by SHAP, LIME and Grad-CAM

Introduction. Convolutional neural networks (CNNs) are a subtype of neural networks developed specifically to work with images [1]. They have achieved great success both in research and in practical applications in recent years, however, one of the major pain points when adopting them is the lack of...

Full description

Saved in:

Bibliographic Details
Published in	Kìbernetika ta komp'ûternì tehnologìï (Online) no. 2; pp. 69 - 76
Main Authors	Yavtukhovskyi, Vladyslav, Tretynyk, Violeta
Format	Journal Article
Language	English
Published	V.M. Glushkov Institute of Cybernetics 06.06.2025
Subjects	computer vision convolutional neural network explainable ai grad-cam lime saliency maps shap xai
Online Access	Get full text
ISSN	2707-4501 2707-451X 2707-451X
DOI	10.34229/2707-451X.25.2.6

Cover

More Information
Summary:	Introduction. Convolutional neural networks (CNNs) are a subtype of neural networks developed specifically to work with images [1]. They have achieved great success both in research and in practical applications in recent years, however, one of the major pain points when adopting them is the lack of ability to interpret what is the reasoning behind their conclusion. Because of this, various explainable artificial intelligence (XAI) methods have been developed; however, it is unclear if they show reasoning or the same aspects of reasoning of CNNs. In recent years some of the most popular methods, LIME[2], SHAP[3], and Grad-CAM [4], were evaluated using tabular data and it was showed how significantly different results are [5] or some were evaluated on a matter of trustworthiness with human evaluation on medical images [6], there is still a lack of measure of how different these methods are on image classification models. This study uses correlation and a popular segmentation measure, Intersection over Union (IoU) [7], to evaluate their differences. The purpose of the article. The aim of this work is to evaluate the level of differences between SHAP, LIME, and Grad-CAM on an image classification task. Results. In this study, we evaluated the similarity between image explanations generated by SHAP, LIME, and Grad-CAM using two different models trained for specific image classification tasks. The evaluation was performed on two datasets, with one fine tuned and one pre-trained model. The datasets were the CBIS-DDSM breast cancer dataset with fine tuned ResNet-18 model, and the Imagenet Object Classification Challenge (IOCC) with a VGG-16 pre-trained model. Our analysis revealed that while all of the methods aim to approximate feature importance, their outputs significantly differ, which makes it difficult to define the true reasoning of the model. Quantitative similarity metrics confirmed that these methods were most often independent, with less than half overlap on average. To add to that, metrics were also significantly different depending on the dataset or the model. The definition of what should be the ground truth or has the best practical use for these methods is complicated, as research contains both numerous variations of fidelity metrics and significantly varies in human-based evaluation perspectives. Future work can include evaluation of the impact of method parameters on the overlap, further investigation on the impact of the dataset and the selected model on the similarity, or quantitative comparison of the models with human-based metrics, such as comparing saliency maps with segmentation masks. Keywords: computer vision, convolutional neural network, Grad-CAM, LIME, SHAP, saliency maps, explainable AI, XAI.
ISSN:	2707-4501 2707-451X 2707-451X
DOI:	10.34229/2707-451X.25.2.6