Cross modal sentiment analysis model based on modal representation learning

With the rapid development of Internet and multimedia technology, people tend to express their feelings and views through video and other media. The key to sentiment analysis in user videos on social media is to fully utilize the embedded multimodal features, such as text, audio, and facial expressi...

Full description

Saved in:

Bibliographic Details
Main Authors	Bai, Jianguo, Yang, Hai, Feng, Cheng, Wang, Shuxian, Li, Xue
Format	Conference Proceeding
Language	English
Published	SPIE 07.08.2024
Online Access	Get full text
ISBN	9781510681866 1510681868
ISSN	0277-786X
DOI	10.1117/12.3038252

Cover

More Information
Summary:	With the rapid development of Internet and multimedia technology, people tend to express their feelings and views through video and other media. The key to sentiment analysis in user videos on social media is to fully utilize the embedded multimodal features, such as text, audio, and facial expressions, to establish efficient deep learning models. The traditional processing methods of simply fusing feature vectors or using multiple models to comprehensively predict results cannot effectively extract the intra modal characteristics and inter modal commonalities of multiple modal data, resulting in unsatisfactory accuracy of sentiment analysis results. In response to the above issues, this article takes monologue videos posted by users on social media as the specific research object and proposes a cross modal sentiment analysis model CMRL based on modal representation learning. By establishing constraints for both independent and fused modal modules, the fused modal module can fully consider the intrinsic characteristics of the modes. In order to enable the model to fully learn the intra modal characteristics, a loss function based on Pearson correlation coefficient is established by combining the sentiment analysis results of the independent modal module's speech modality, text modality, and expression image modality data with the sentiment analysis results of the fusion modal module. In order to prevent loss or confusion of intra modal features after feature fusion, the speech modal features, text modal features, and expression image features extracted by the Transformer in the independent modal module are fused, and a loss function based on Spearman correlation coefficient is established with the fused features of the fused modal module.
Bibliography:	Conference Date: 2024-05-10\|2024-05-12 Conference Location: Nanchang, China
ISBN:	9781510681866 1510681868
ISSN:	0277-786X
DOI:	10.1117/12.3038252