Smart voice recognition based on deep learning for depression diagnosis

Depressive disorder is a kind of mental illness with a high incidence rate due to the stress from the environment or social impact. Depression affects mood and behavior that leads to various problem domains such as education, family, and workplace problems. Suicide attempt is found in severe depress...

Full description

Saved in:

Bibliographic Details
Published in	Artificial life and robotics Vol. 28; no. 2; pp. 332 - 342
Main Authors	Suparatpinyo, Sukit, Soonthornphisaj, Nuanwan
Format	Journal Article
Language	English
Published	Tokyo Springer Japan 01.05.2023 Springer Nature B.V
Subjects	Accuracy Acoustics Algorithms Artificial Intelligence Audio data Colleges & universities Computation by Abstract Devices Computer Science Control Datasets Deep learning Discrete cosine transform Fast Fourier transformations Hospitals Machine learning Mechatronics Mental depression Mental disorders Mental health Original Article Regression analysis Robotics Schizophrenia Self assessment Spectrographs Speech Teenagers Voice recognition Window functions Spectrograph Depression Audio file Deep residual network Recognition
Online Access	Get full text
ISSN	1433-5298 1614-7456
DOI	10.1007/s10015-023-00852-4

Cover

More Information
Summary:	Depressive disorder is a kind of mental illness with a high incidence rate due to the stress from the environment or social impact. Depression affects mood and behavior that leads to various problem domains such as education, family, and workplace problems. Suicide attempt is found in severe depression cases as well. However, depression is a treatable condition if diagnosed by psychiatrists. In Thailand, many people who aware of mental disorders do not seek help from psychiatric hospitals due to long waiting services and high fees. Therefore, we aim to create an application for users to do self-assessment by collecting their voice signal data. In our experiment, we define the voice data obtained from the depressive patient during a therapy session in a psychiatric hospital as positive class. The negative class is the voice data of non-depressive people obtained from the interview session with university students. Each audio file has been rendered into spectrograph. The spectrograph is a visual representation of power spectrum. A power spectrum is the Mel frequency-spaced cepstral coefficients (MFCCs) extracted from the human voice that changes over time using fast Fourier transform and discrete cosine transform (DCT) algorithms. Since some research claimed that DCT causes some spectral features to be loss, we do empirical studies between applied DCT and non- DCT spectrographs set. Moreover some research studies stated that larger window provides more detail of speech activity on power spectrum which affected to the performance of depressive detection, so we explore Blackman-Harris and Blackman window functions to create different set of spectrographs to prove that idea on Thai speech dataset. Deep learning models based on the deep residual network (ResNet) are explored to see its potential on classification. Different numbers of convolution layers such as ResNet-34, ResNet-50, and ResNet-101 are examined, respectively. The experimental results show that both trained ResNet-50 model from different type of spectrograph can achieve higher than 70% of F1-Score which is the best performance above other approaches. We found that the model learning from spectrograph extracted by Blackman window function with non-DCT algorithm provides the best sensitivity at 74.45% showing. To the best of our knowledge, our approach gives the highest F1-score when compared to the state of the art methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1433-5298 1614-7456
DOI:	10.1007/s10015-023-00852-4