CVRSF-Net: Image Emotion Recognition by Combining Visual Relationship Features and Scene Features

Image emotion recognition, which aims to analyze the emotional responses of people to various stimuli in images, has attracted substantial attention in recent years with the proliferation of social media. As human emotion is a highly complex and abstract cognitive process, simply extracting local or...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on emerging topics in computational intelligence Vol. 9; no. 3; pp. 2321 - 2333
Main Authors	Luo, Yutong, Zhong, Xinyue, Xie, Jialan, Liu, Guangyuan
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.06.2025 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Accuracy Algorithms Computational modeling Deep learning Emotion recognition Emotional factors Emotions Feature extraction graph attention network Image emotion recognition Modules Psychology Recurrent neural networks scene feature Semantics Social networking (online) visual relationship feature Visualization
Online Access	Get full text
ISSN	2471-285X 2471-285X
DOI	10.1109/TETCI.2025.3543300

Cover

More Information
Summary:	Image emotion recognition, which aims to analyze the emotional responses of people to various stimuli in images, has attracted substantial attention in recent years with the proliferation of social media. As human emotion is a highly complex and abstract cognitive process, simply extracting local or global features from an image is not sufficient for recognizing the emotion of an image. The psychologist Moshe proposed that visual objects are usually embedded in a scene with other related objects during human visual comprehension of images. Therefore, we propose a two-branch emotion-recognition network known as the combined visual relationship feature and scene feature network (CVRSF-Net). In the scene feature-extraction branch, a pretrained CLIP model is adopted to extract the visual features of images, with a feature channel weighting module to extract the scene features. In the visual relationship feature-extraction branch, a visual relationship detection model is used to extract the visual relationships in the images, and a semantic fusion module fuses the scenes and visual relationship features. Furthermore, we spatially weight the visual relationship features using class activation maps. Finally, the implicit relationships between different visual relationship features are obtained using a graph attention network, and a two-branch network loss function is designed to train the model. The experimental results showed that the recognition rates of the proposed network were 79.80%, 69.81%, and 36.72% for the FI-8, Emotion-6, and WEBEmo datasets, respectively. The proposed algorithm achieves state-of-the-art results compared to existing methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2471-285X 2471-285X
DOI:	10.1109/TETCI.2025.3543300