Learning deep multimodal affective features for spontaneous speech emotion recognition

•This paper proposes a new method of spontaneous speech emotion recognition by using deep multimodal audio feature learning based on multiple deep convolutional neural networks (multi-CNNs).•The proposed method initially generates three different audio inputs for multi-CNNs so as to learn deep multi...

Full description

Saved in:

Bibliographic Details
Published in	Speech communication Vol. 127; pp. 73 - 81
Main Authors	Zhang, Shiqing, Tao, Xin, Chuang, Yuelong, Zhao, Xiaoming
Format	Journal Article
Language	English
Published	Amsterdam Elsevier B.V 01.03.2021 Elsevier Science Ltd
Subjects	Artificial neural networks Classification Convolutional neural networks Deep multimodal feature learning Dynamic models Emotion recognition Emotions Machine learning Modelling Multimodality Neural networks Score-level fusion Segments Speech emotion recognition Speech recognition Spontaneous speech Temporal-spatial Three dimensional models Two dimensional models Waveforms Score-level fusion Temporal-spatial Convolutional neural networks Deep multimodal feature learning Speech emotion recognition
Online Access	Get full text
ISSN	0167-6393 1872-7182
DOI	10.1016/j.specom.2020.12.009

Cover

More Information
Summary:	•This paper proposes a new method of spontaneous speech emotion recognition by using deep multimodal audio feature learning based on multiple deep convolutional neural networks (multi-CNNs).•The proposed method initially generates three different audio inputs for multi-CNNs so as to learn deep multimodal segment-level features from the original 1D audio signal in three aspects: 1) a 1D CNN for 1D raw waveform modeling, 2) a 2D CNN for 2D time-frequency Mel-spectrogram modeling, and 3) a 3D CNN for temporal-spatial dynamic modeling. Then, an average-pooling is performed on the obtained segment-level classification results from 1D, 2D, and 3D CNN networks, to produce utterance-level classification results. Finally, a score-level fusion strategy is adopted as a multi-CNN fusion method to integrate different utterance-level classification results for final emotion classification.•The learned deep multimodal audio features are shown to be complementary to each other so that they are combined in a multi-CNN fusion network to achieve significantly improved emotion classification performance.•Experiments are conducted on two challenging spontaneous emotional speech datasets, i.e., the AFEW5.0 and BAUM-1 s databases, demonstrating the promising performance of our proposed method. Recently, spontaneous speech emotion recognition has become an active and challenging research subject. This paper proposes a new method of spontaneous speech emotion recognition by using deep multimodal audio feature learning based on multiple deep convolutional neural networks (multi-CNNs). The proposed method initially generates three different audio inputs for multi-CNNs so as to learn deep multimodal segment-level features from the original 1D audio signal in three aspects: 1) a 1D CNN for 1D raw waveform modeling, 2) a 2D CNN for 2D time-frequency Mel-spectrogram modeling, and 3) a 3D CNN for temporal-spatial dynamic modeling. Then, an average-pooling is performed on the obtained segment-level classification results from 1D, 2D, and 3D CNN networks, to produce utterance-level classification results. Finally, a score-level fusion strategy is adopted as a multi-CNN fusion method to integrate different utterance-level classification results for final emotion classification. The learned deep multimodal audio features are shown to be complementary to each other so that they are combined in a multi-CNN fusion network to achieve significantly improved emotion classification performance. Experiments are conducted on two challenging spontaneous emotional speech datasets, i.e., the AFEW5.0 and BAUM-1 s databases, demonstrating the promising performance of our proposed method.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0167-6393 1872-7182
DOI:	10.1016/j.specom.2020.12.009