Pathological voice detection using optimized deep residual neural network and explainable artificial intelligence
Voice disorders affect individuals’ vocal quality and communication abilities, which pose significant challenges for both individuals and healthcare providers. The accurate and timely detection of voice disorders is crucial in facilitating early intervention and effective treatment. This study propo...
Saved in:
| Published in | Multimedia tools and applications Vol. 84; no. 19; pp. 21863 - 21889 |
|---|---|
| Main Authors | , |
| Format | Journal Article |
| Language | English |
| Published |
New York
Springer US
01.06.2025
Springer Nature B.V |
| Subjects | |
| Online Access | Get full text |
| ISSN | 1573-7721 1380-7501 1573-7721 |
| DOI | 10.1007/s11042-024-20348-y |
Cover
| Summary: | Voice disorders affect individuals’ vocal quality and communication abilities, which pose significant challenges for both individuals and healthcare providers. The accurate and timely detection of voice disorders is crucial in facilitating early intervention and effective treatment. This study proposes a new noninvasive approach for voice disorder detection based on an optimized deep residual neural network. Input speech samples are transformed into mel-spectrogram time-frequency images and applied to train the ResNet-50 transfer learning model. The spectrogram time-frequency representation effectively captures intricate patterns and features that might indicate the presence of voice disorders exploiting local and global characteristics. Four hyperparameters of the ResNet-50 model are optimized using the snake optimization algorithm, which delivers an optimum residual deep transfer learning (DTL) model with an enhanced voice pathology detection rate. The proposed snake-optimized ResNet-50 model is evaluated on four popular voice pathology datasets: AVPD, SVD, PdA and VOICED. The results demonstrate the efficacy of the optimized ResNet-50 framework in accurately classifying healthy and pathological voice samples with 98.13% accuracy. Comparisons with recent machine learning and deep learning models reveal the superiority of the proposed approach in terms of F1-score, sensitivity, specificity and accuracy. Finally, Gradient-weighted class activation mapping (Grad-CAM) explainable artificial intelligence (XAI) is utilized for visualizing and interpreting the decision-making process. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 1573-7721 1380-7501 1573-7721 |
| DOI: | 10.1007/s11042-024-20348-y |