Speech emotion recognition via graph-based representations

Speech emotion recognition (SER) has gained an increased interest during the last decades as part of enriched affective computing. As a consequence, a variety of engineering approaches have been developed addressing the challenge of the SER problem, exploiting different features, learning algorithms...

Full description

Saved in:

Bibliographic Details
Published in	Scientific reports Vol. 14; no. 1; pp. 4484 - 11
Main Authors	Pentari, Anastasia, Kafentzis, George, Tsiknakis, Manolis
Format	Journal Article
Language	English
Published	London Nature Publishing Group UK 23.02.2024 Nature Publishing Group Nature Portfolio
Subjects	639/166 639/705 Algorithms Datasets Deep learning Emotions Humanities and Social Sciences multidisciplinary Science Science (multidisciplinary) Speech
Online Access	Get full text
ISSN	2045-2322 2045-2322
DOI	10.1038/s41598-024-52989-2

Cover

More Information
Summary:	Speech emotion recognition (SER) has gained an increased interest during the last decades as part of enriched affective computing. As a consequence, a variety of engineering approaches have been developed addressing the challenge of the SER problem, exploiting different features, learning algorithms, and datasets. In this paper, we propose the application of the graph theory for classifying emotionally-colored speech signals. Graph theory provides tools for extracting statistical as well as structural information from any time series. We propose to use the mentioned information as a novel feature set. Furthermore, we suggest setting a unique feature-based identity for each emotion belonging to each speaker. The emotion classification is performed by a Random Forest classifier in a Leave-One-Speaker-Out Cross Validation (LOSO-CV) scheme. The proposed method is compared with two state-of-the-art approaches involving well known hand-crafted features as well as deep learning architectures operating on mel-spectrograms. Experimental results on three datasets, EMODB (German, acted) and AESDD (Greek, acted), and DEMoS (Italian, in-the-wild), reveal that our proposed method outperforms the comparative methods in these datasets. Specifically, we observe an average UAR increase of almost 18 % , 8 % and 13 % , respectively.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2045-2322 2045-2322
DOI:	10.1038/s41598-024-52989-2