Microphone Array Speech Separation Algorithm Based on TC-ResNet

Traditional separation methods have limited ability to handle the speech separation problem in high reverberant and low signal-to-noise ratio (SNR) environments, and thus achieve unsatisfactory results. In this study, a convolutional neural network with temporal convolution and residual network (TC-...

Full description

Saved in:

Bibliographic Details
Published in	Computers, materials & continua Vol. 69; no. 2; pp. 2705 - 2716
Main Authors	Zhou, Lin, Xu, Yue, Wang, Tianyi, Feng, Kun, Shi, Jingang
Format	Journal Article
Language	English
Published	Henderson Tech Science Press 2021
Subjects	Algorithms Arrays Artificial neural networks Computational efficiency Computing costs Deep learning Feature extraction Intelligibility Neural networks Separation Signal processing Signal to noise ratio Simulation Speech Tensors Training
Online Access	Get full text
ISSN	1546-2226 1546-2218 1546-2226
DOI	10.32604/cmc.2021.017080

Cover

More Information
Summary:	Traditional separation methods have limited ability to handle the speech separation problem in high reverberant and low signal-to-noise ratio (SNR) environments, and thus achieve unsatisfactory results. In this study, a convolutional neural network with temporal convolution and residual network (TC-ResNet) is proposed to realize speech separation in a complex acoustic environment. A simplified steered-response power phase transform, denoted as GSRP-PHAT, is employed to reduce the computational cost. The extracted features are reshaped to a special tensor as the system inputs and implements temporal convolution, which not only enlarges the receptive field of the convolution layer but also significantly reduces the network computational cost. Residual blocks are used to combine multiresolution features and accelerate the training procedure. A modified ideal ratio mask is applied as the training target. Simulation results demonstrate that the proposed microphone array speech separation algorithm based on TC-ResNet achieves a better performance in terms of distortion ratio, source-to-interference ratio, and short-time objective intelligibility in low SNR and high reverberant environments, particularly in untrained situations. This indicates that the proposed method has generalization to untrained conditions.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1546-2226 1546-2218 1546-2226
DOI:	10.32604/cmc.2021.017080