Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention
Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual, a...
Saved in:
| Published in | IEEE transactions on multimedia Vol. 15; no. 7; pp. 1553 - 1568 |
|---|---|
| Main Authors | , , , , , , |
| Format | Journal Article |
| Language | English |
| Published |
New York, NY
IEEE
01.11.2013
Institute of Electrical and Electronics Engineers The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Subjects | |
| Online Access | Get full text |
| ISSN | 1520-9210 1941-0077 |
| DOI | 10.1109/TMM.2013.2267205 |
Cover
| Summary: | Multimodal streams of sensory information are naturally parsed and integrated by humans using signal-level feature extraction and higher level cognitive processes. Detection of attention-invoking audiovisual segments is formulated in this work on the basis of saliency models for the audio, visual, and textual information conveyed in a video stream. Aural or auditory saliency is assessed by cues that quantify multifrequency waveform modulations, extracted through nonlinear operators and energy tracking. Visual saliency is measured through a spatiotemporal attention model driven by intensity, color, and orientation. Textual or linguistic saliency is extracted from part-of-speech tagging on the subtitles information available with most movie distributions. The individual saliency streams, obtained from modality-depended cues, are integrated in a multimodal saliency curve, modeling the time-varying perceptual importance of the composite video stream and signifying prevailing sensory events. The multimodal saliency representation forms the basis of a generic, bottom-up video summarization algorithm. Different fusion schemes are evaluated on a movie database of multimodal saliency annotations with comparative results provided across modalities. The produced summaries, based on low-level features and content-independent fusion and selection, are of subjectively high aesthetic and informative quality. |
|---|---|
| Bibliography: | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 content type line 23 |
| ISSN: | 1520-9210 1941-0077 |
| DOI: | 10.1109/TMM.2013.2267205 |