VideoGrounding-DINO: Towards Open-Vocabulary Spatio- Temporal Video Grounding

Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio- Temporal Video Grounding task. Unlike prevalent closed-set approac...

Full description

Saved in:
Bibliographic Details
Published in2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) pp. 18909 - 18918
Main Authors Wasim, Syed Talal, Naseer, Muzammal, Khan, Salman, Yang, Ming-Hsuan, Khan, Fahad Shahbaz
Format Conference Proceeding
LanguageEnglish
Published IEEE 16.06.2024
SeriesIEEE Conference on Computer Vision and Pattern Recognition
Subjects
Online AccessGet full text
ISBN9798350353013
9798350353006
ISSN1063-6919
DOI10.1109/CVPR52733.2024.01789

Cover

More Information
Summary:Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio- Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and pre-defined vocabularies, our model leverages pre-trained rep-resentations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap be-tween natural language and diverse visual content, achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model, surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demon-strating superior performance in open-vocabulary scenar-ios. Notably, the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (VI and V2) datasets. Furthermore, in open-vocabulary evaluations on HC-STVG VI and YouCook-Interactions, our model surpasses the re-cent best-performing models by 4.88 m.vloU and 1.83% ac-curacy, demonstrating its efficacy in handling diverse lin-guistic and visual concepts for improved video understanding. Our codes will be publicly released.
ISBN:9798350353013
9798350353006
ISSN:1063-6919
DOI:10.1109/CVPR52733.2024.01789