Enhancing Text-Video Retrieval Performance With Low-Salient but Discriminative Objects

Text-video retrieval aims to establish a matching relationship between a video and its corresponding text. However, previous works have primarily focused on salient video subjects, such as humans or animals, often overlooking Low-Salient but Discriminative Objects (LSDOs) that play a critical role i...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on image processing Vol. 34; pp. 581 - 593
Main Authors	Zheng, Yanwei, Huang, Bowen, Chen, Zekai, Yu, Dongxiao
Format	Journal Article
Language	English
Published	United States IEEE 01.01.2025 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Aggregates Computational modeling Context modeling cross-modal attention Encoding Feature extraction Indexes low-salient but discriminative objects Prototypes Retrieval Semantics Text-video retrieval Transformers Visualization
Online Access	Get full text
ISSN	1057-7149 1941-0042 1941-0042
DOI	10.1109/TIP.2025.3527369

Cover

More Information
Summary:	Text-video retrieval aims to establish a matching relationship between a video and its corresponding text. However, previous works have primarily focused on salient video subjects, such as humans or animals, often overlooking Low-Salient but Discriminative Objects (LSDOs) that play a critical role in understanding content. To address this limitation, we propose a novel model that enhances retrieval performance by emphasizing these overlooked elements across video and text modalities. In the video modality, our model first incorporates a feature selection module to gather video-level LSDO features, and applies cross-modal attention to assign frame-specific weights based on relevance, yielding frame-level LSDO features. In the text modality, text-level LSDO features are captured by generating multiple object prototypes in a sparse aggregation manner. Extensive experiments on benchmark datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo, demonstrate that our model achieves state-of-the-art results across various evaluation metrics.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1057-7149 1941-0042 1941-0042
DOI:	10.1109/TIP.2025.3527369