Evaluation of Similarity of Image Explanations Produced by SHAP, LIME and Grad-CAM
Introduction. Convolutional neural networks (CNNs) are a subtype of neural networks developed specifically to work with images [1]. They have achieved great success both in research and in practical applications in recent years, however, one of the major pain points when adopting them is the lack of...
Saved in:
| Published in | Kìbernetika ta komp'ûternì tehnologìï (Online) no. 2; pp. 69 - 76 |
|---|---|
| Main Authors | , |
| Format | Journal Article |
| Language | English |
| Published |
V.M. Glushkov Institute of Cybernetics
06.06.2025
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 2707-4501 2707-451X 2707-451X |
| DOI | 10.34229/2707-451X.25.2.6 |
Cover
| Summary: | Introduction. Convolutional neural networks (CNNs) are a subtype of neural networks developed specifically to work with images [1]. They have achieved great success both in research and in practical applications in recent years, however, one of the major pain points when adopting them is the lack of ability to interpret what is the reasoning behind their conclusion. Because of this, various explainable artificial intelligence (XAI) methods have been developed; however, it is unclear if they show reasoning or the same aspects of reasoning of CNNs. In recent years some of the most popular methods, LIME[2], SHAP[3], and Grad-CAM [4], were evaluated using tabular data and it was showed how significantly different results are [5] or some were evaluated on a matter of trustworthiness with human evaluation on medical images [6], there is still a lack of measure of how different these methods are on image classification models. This study uses correlation and a popular segmentation measure, Intersection over Union (IoU) [7], to evaluate their differences. The purpose of the article. The aim of this work is to evaluate the level of differences between SHAP, LIME, and Grad-CAM on an image classification task. Results. In this study, we evaluated the similarity between image explanations generated by SHAP, LIME, and Grad-CAM using two different models trained for specific image classification tasks. The evaluation was performed on two datasets, with one fine tuned and one pre-trained model. The datasets were the CBIS-DDSM breast cancer dataset with fine tuned ResNet-18 model, and the Imagenet Object Classification Challenge (IOCC) with a VGG-16 pre-trained model. Our analysis revealed that while all of the methods aim to approximate feature importance, their outputs significantly differ, which makes it difficult to define the true reasoning of the model. Quantitative similarity metrics confirmed that these methods were most often independent, with less than half overlap on average. To add to that, metrics were also significantly different depending on the dataset or the model. The definition of what should be the ground truth or has the best practical use for these methods is complicated, as research contains both numerous variations of fidelity metrics and significantly varies in human-based evaluation perspectives. Future work can include evaluation of the impact of method parameters on the overlap, further investigation on the impact of the dataset and the selected model on the similarity, or quantitative comparison of the models with human-based metrics, such as comparing saliency maps with segmentation masks. Keywords: computer vision, convolutional neural network, Grad-CAM, LIME, SHAP, saliency maps, explainable AI, XAI. |
|---|---|
| ISSN: | 2707-4501 2707-451X 2707-451X |
| DOI: | 10.34229/2707-451X.25.2.6 |