VideoGrounding-DINO: Towards Open-Vocabulary Spatio- Temporal Video Grounding

Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio- Temporal Video Grounding task. Unlike prevalent closed-set approac...

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) pp. 18909 - 18918
Main Authors	Wasim, Syed Talal, Naseer, Muzammal, Khan, Salman, Yang, Ming-Hsuan, Khan, Fahad Shahbaz
Format	Conference Proceeding
Language	English
Published	IEEE 16.06.2024
Series	IEEE Conference on Computer Vision and Pattern Recognition
Subjects	Adaptation models Grounding MultiModal Natural languages Open Vocabulary Semantics Training data Video Grounding Vision Language Visualization Vocabulary
Online Access	Get full text
ISBN	9798350353013 9798350353006
ISSN	1063-6919
DOI	10.1109/CVPR52733.2024.01789

Cover

More Information
Summary:	Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio- Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and pre-defined vocabularies, our model leverages pre-trained rep-resentations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap be-tween natural language and diverse visual content, achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model, surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demon-strating superior performance in open-vocabulary scenar-ios. Notably, the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (VI and V2) datasets. Furthermore, in open-vocabulary evaluations on HC-STVG VI and YouCook-Interactions, our model surpasses the re-cent best-performing models by 4.88 m.vloU and 1.83% ac-curacy, demonstrating its efficacy in handling diverse lin-guistic and visual concepts for improved video understanding. Our codes will be publicly released.
ISBN:	9798350353013 9798350353006
ISSN:	1063-6919
DOI:	10.1109/CVPR52733.2024.01789