Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention

Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual, a...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on multimedia Vol. 15; no. 7; pp. 1553 - 1568
Main Authors Evangelopoulos, Georgios, Zlatintsi, Athanasia, Potamianos, Alexandros, Maragos, Petros, Rapantzikos, Konstantinos, Skoumas, Georgios, Avrithis, Yannis
Format Journal Article
LanguageEnglish
Published New York, NY IEEE 01.11.2013
Institute of Electrical and Electronics Engineers
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN1520-9210
1941-0077
DOI10.1109/TMM.2013.2267205

Cover

More Information
Summary:Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual, and textual information conveyed in a video stream. Aural or auditory saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color, and orientation. Textual or linguistic saliency is extracted from part-of-speech tagging on the subtitles information available with most movie distributions. The individual saliency streams, obtained from modality-depended cues, are integrated in a multimodal saliency curve, modeling the time-varying perceptual importance of the composite video stream and signifying prevailing sensory events. The multimodal saliency representation forms the basis of a generic, bottom-up video summarization algorithm. Different fusion schemes are evaluated on a movie database of multimodal saliency annotations with comparative results provided across modalities. The produced summaries, based on low-level features and content-independent fusion and selection, are of subjectively high aesthetic and informative quality.
Bibliography:ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 14
content type line 23
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2013.2267205