MPN: Multimodal Parallel Network for Audio-Visual Event Localization

Audio-visual event localization aims to localize an event that is both audible and visible in the wild, which is a widespread audio-visual scene analysis task for unconstrained videos. To address this task, we propose a Multimodal Parallel Network (MPN), which can perceive global semantics and unmix...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings (IEEE International Conference on Multimedia and Expo) pp. 1 - 6
Main Authors	Yu, Jiashuo, Cheng, Ying, Feng, Rui
Format	Conference Proceeding
Language	English
Published	IEEE 05.07.2021
Subjects	Audio-Visual Event Localization Conferences Image analysis Location awareness Multimodal Attention Parallel Network Predictive models Scene Understanding Semantics Streaming media Task analysis
Online Access	Get full text
ISSN	1945-788X
DOI	10.1109/ICME51207.2021.9428373

Cover

More Information
Summary:	Audio-visual event localization aims to localize an event that is both audible and visible in the wild, which is a widespread audio-visual scene analysis task for unconstrained videos. To address this task, we propose a Multimodal Parallel Network (MPN), which can perceive global semantics and unmixed local information parallelly. Specifically, our MPN framework consists of a classification subnetwork to predict event categories and a localization subnetwork to predict event boundaries. The classification subnetwork is constructed by the Multimodal Co-attention Module (MCM) and obtains global contexts. The localization subnetwork consists of Multimodal Bottleneck Attention Module (MBAM), which is designed to extract fine-grained segment-level contents. Extensive experiments demonstrate that our framework achieves the state-of-the-art performance both in fully supervised and weakly supervised settings on the Audio-Visual Event (AVE) dataset.
ISSN:	1945-788X
DOI:	10.1109/ICME51207.2021.9428373