Generating Human-Like Behaviors Using Joint, Speech-Driven Models for Conversational Agents

During human communication, every spoken message is intrinsically modulated within different verbal and nonverbal cues that are externalized through various aspects of speech and facial gestures. These communication channels are strongly interrelated, which suggests that generating human-like behavi...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on audio, speech, and language processing Vol. 20; no. 8; pp. 2329 - 2340
Main Authors	Mariooryad, Soroosh, Busso, Carlos
Format	Journal Article
Language	English
Published	Piscataway, NJ IEEE 01.10.2012 Institute of Electrical and Electronics Engineers
Subjects	Applied sciences Conversational agent (CA) Detection, estimation, filtering, equalization, prediction dynamic Bayesian network (DBN) Exact sciences and technology Eyebrows Facial animation Hidden Markov models Humans Image processing Information, signal and communications theory Modulation, demodulation Signal and communications theory Signal processing Signal, noise Speech Speech processing Telecommunications and information theory visual prosody Performance evaluation Signal compression Parameter estimation Conversational agent (CA) Image processing visual prosody Transmission channel Visual perception Learning Case study Prosody Image synthesis Channel estimation Modulation Gesture Database Bayes network facial animation dynamic Bayesian network (DBN) Facial expression
Online Access	Get full text
ISSN	1558-7916 1558-7924
DOI	10.1109/TASL.2012.2201476

Cover

More Information
Summary:	During human communication, every spoken message is intrinsically modulated within different verbal and nonverbal cues that are externalized through various aspects of speech and facial gestures. These communication channels are strongly interrelated, which suggests that generating human-like behavior requires a careful study of their relationship. Neglecting the mutual influence of different communicative channels in the modeling of natural behavior for a conversational agent may result in unrealistic behaviors that can affect the intended visual perception of the animation. This relationship exists both between audiovisual information and within different visual aspects. This paper explores the idea of using joint models to preserve the coupling not only between speech and facial expression, but also within facial gestures. As a case study, the paper focuses on building a speech-driven facial animation framework to generate natural head and eyebrow motions. We propose three dynamic Bayesian networks (DBNs), which make different assumptions about the coupling between speech, eyebrow and head motion. Synthesized animations are produced based on the MPEG-4 facial animation standard, using the audiovisual IEMOCAP database. The experimental results based on perceptual evaluations reveal that the proposed joint models (speech/eyebrow/head) outperform audiovisual models that are separately trained (speech/head and speech/eyebrow).
ISSN:	1558-7916 1558-7924
DOI:	10.1109/TASL.2012.2201476