Multimodal Emotion Recognition based on Face and Speech using Deep Convolution Neural Network and Long Short Term Memory

Multimodal emotion recognition (MER) is crucial for analyzing a person’s mental behavior and health to enhance the performance of human–computer-interaction systems. Various deep learning-based MER systems have been presented in the last decade. However, the outcomes of the MER schemes are limited d...

Full description

Saved in:

Bibliographic Details
Published in	Circuits, systems, and signal processing Vol. 44; no. 9; pp. 6622 - 6649
Main Authors	Taware, Shwetkranti, Thakare, Anuradha D.
Format	Journal Article
Language	English
Published	New York Springer US 01.09.2025 Springer Nature B.V
Subjects	Accuracy Artificial neural networks Circuits and Systems Convolution Datasets Deep learning Distance learning Electrical Engineering Electrocardiography Electromyography Electronics and Microelectronics Emotion recognition Emotions Engineering Feature selection Instrumentation Machine learning Neural networks Particle swarm optimization Physiology Real time Reliability Representations Security Semantics Signal,Image and Speech Processing Speech Utility theory Deep learning Affective computing Local binary pattern Multimodal emotion recognition Particle swarm optimization Deep Convolution neural network
Online Access	Get full text
ISSN	0278-081X 1531-5878
DOI	10.1007/s00034-025-03080-2

Cover

More Information
Summary:	Multimodal emotion recognition (MER) is crucial for analyzing a person’s mental behavior and health to enhance the performance of human–computer-interaction systems. Various deep learning-based MER systems have been presented in the last decade. However, the outcomes of the MER schemes are limited due to poor feature representation, lower correlation in short and long-term features, security issues, lower generalization capability, lower reliability of emotional modality systems, and higher computational intricacy of deep learning models. This paper presents the MER based on facial images and speech data using parallel deep convolution neural network (PDCNN) and bidirectional long short-term memory (BiLSTM) to improve the system’s reliability, security, and robustness. The PDCNN aims to offer superior generalization capability and feature depiction; however, BiLSTM offers better long-term dependency, temporal representation, and correlation between the multimodal data’s short and long-term attributes. The novel hybrid Particle Swarm Optimization based on Multi-Attribute Utility Theory and Archimedes Optimization Algorithm (PMA) is used to select crucial features of the facial expressions and speech data to minimize the computational intricacy of the PDCNN-LSTM framework. It offers an overall improved accuracy of 99.22%, precision of 0.9967, recall of 0.9933, and F1-score of 0.9949 for MER on the BAUM dataset compared to traditional techniques.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0278-081X 1531-5878
DOI:	10.1007/s00034-025-03080-2