CVRSF-Net: Image Emotion Recognition by Combining Visual Relationship Features and Scene Features

Image emotion recognition, which aims to analyze the emotional responses of people to various stimuli in images, has attracted substantial attention in recent years with the proliferation of social media. As human emotion is a highly complex and abstract cognitive process, simply extracting local or...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on emerging topics in computational intelligence Vol. 9; no. 3; pp. 2321 - 2333
Main Authors Luo, Yutong, Zhong, Xinyue, Xie, Jialan, Liu, Guangyuan
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 01.06.2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN2471-285X
2471-285X
DOI10.1109/TETCI.2025.3543300

Cover

More Information
Summary:Image emotion recognition, which aims to analyze the emotional responses of people to various stimuli in images, has attracted substantial attention in recent years with the proliferation of social media. As human emotion is a highly complex and abstract cognitive process, simply extracting local or global features from an image is not sufficient for recognizing the emotion of an image. The psychologist Moshe proposed that visual objects are usually embedded in a scene with other related objects during human visual comprehension of images. Therefore, we propose a two-branch emotion-recognition network known as the combined visual relationship feature and scene feature network (CVRSF-Net). In the scene feature-extraction branch, a pretrained CLIP model is adopted to extract the visual features of images, with a feature channel weighting module to extract the scene features. In the visual relationship feature-extraction branch, a visual relationship detection model is used to extract the visual relationships in the images, and a semantic fusion module fuses the scenes and visual relationship features. Furthermore, we spatially weight the visual relationship features using class activation maps. Finally, the implicit relationships between different visual relationship features are obtained using a graph attention network, and a two-branch network loss function is designed to train the model. The experimental results showed that the recognition rates of the proposed network were 79.80%, 69.81%, and 36.72% for the FI-8, Emotion-6, and WEBEmo datasets, respectively. The proposed algorithm achieves state-of-the-art results compared to existing methods.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2471-285X
2471-285X
DOI:10.1109/TETCI.2025.3543300