Use of bimodal coherence to resolve the permutation problem in convolutive BSS

Recent studies show that facial information contained in visual speech can be helpful for the performance enhancement of audio-only blind source separation (BSS) algorithms. Such information is exploited through the statistical characterization of the coherence between the audio and visual speech us...

Full description

Saved in:

Bibliographic Details
Published in	Signal processing Vol. 92; no. 8; pp. 1916 - 1927
Main Authors	Liu, Qingju, Wang, Wenwu, Jackson, Philip
Format	Journal Article
Language	English
Published	Elsevier B.V 01.08.2012
Subjects	Adapted expectation maximization (AEM) Algorithms Audio–visual coherence Coherence Convolutive blind source separation (BSS) Feature selection and fusion Gaussian mixture model (GMM) Indeterminacy Maximization Permutations Separation Speech Visual Gaussian mixture model (GMM) Audio–visual coherence Adapted expectation maximization (AEM) Indeterminacy Feature selection and fusion Convolutive blind source separation (BSS)
Online Access	Get full text
ISSN	0165-1684 1872-7557 1872-7557
DOI	10.1016/j.sigpro.2011.11.007

Cover

More Information
Summary:	Recent studies show that facial information contained in visual speech can be helpful for the performance enhancement of audio-only blind source separation (BSS) algorithms. Such information is exploited through the statistical characterization of the coherence between the audio and visual speech using, e.g., a Gaussian mixture model (GMM). In this paper, we present three contributions. With the synchronized features, we propose an adapted expectation maximization (AEM) algorithm to model the audio–visual coherence in the off-line training process. To improve the accuracy of this coherence model, we use a frame selection scheme to discard nonstationary features. Then with the coherence maximization technique, we develop a new sorting method to solve the permutation problem in the frequency domain. We test our algorithm on a multimodal speech database composed of different combinations of vowels and consonants. The experimental results show that our proposed algorithm outperforms traditional audio-only BSS, which confirms the benefit of using visual speech to assist in separation of the audio.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0165-1684 1872-7557 1872-7557
DOI:	10.1016/j.sigpro.2011.11.007