Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval

Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across different modalities ( e.g ., searching for videos with texts). Many existing efforts tend to learn a common semantic representation embedding space in which items of different modalities can be directly compa...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 33; no. 10; p. 1
Main Authors	Ma, Wentao, Chen, Qingchao, Zhou, Tongqing, Zhao, Shan, Cai, Zhiping
Format	Journal Article
Language	English
Published	New York IEEE 01.10.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Annotations Contrastive Learning Correlation Cross-modal Retrieval Distillation Electronic mail Knowledge Distillation Matching Reliability Representations Retrieval Robustness Semantics Task analysis Teachers Texts Video Videos
Online Access	Get full text
ISSN	1051-8215 1558-2205
DOI	10.1109/TCSVT.2023.3257193

Cover

More Information
Summary:	Cross-modal retrieval aims to enable a flexible bi-directional retrieval experience across different modalities ( e.g ., searching for videos with texts). Many existing efforts tend to learn a common semantic representation embedding space in which items of different modalities can be directly compared, wherein the positive global representations of video-text pairs are pulled close while the negative ones are pushed apart via pair-wise ranking loss. However, such a vanilla loss would unfortunately yield ambiguous feature embeddings for texts of different videos, causing inaccurate cross-modal matching and unreliable retrievals. Toward this end, we propose a multimodal contrastive knowledge distillation method for instance video-text retrieval, called MCKD, by adaptively using the general knowledge of self-supervised model (teacher) to calibrate mixed boundaries. Specifically, the teacher model is tailored for robust (less-ambiguous) visual-text joint semantic space by maximizing mutual information of co-occurred modalities during multimodal contrastive learning. This robust and structural inter-instance knowledge is then distilled, with the help of explicit discrimination loss, to a student model for improved matching performance. Extensive experiments on four public benchmark video-text datasets (MSR-VTT, TGIF, VATEX, and Youtube2Text) demonstrate that our MCKD can achieve at most 8.8%, 6.4%, 5.9%, and 5.3% improvement in text-to-video performance by the R@1 metric, compared with 14 SoTA baselines.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2023.3257193