Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval
Composed Image Retrieval (CIR) allows users to search target images with a multimodal query, comprising a reference image and a modification text that describes the user's modification demand over the reference image. Nevertheless, due to the expensive labor cost of training data annotation, re...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
24.03.2025
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2503.19296 |
Cover
Summary: | Composed Image Retrieval (CIR) allows users to search target images with a
multimodal query, comprising a reference image and a modification text that
describes the user's modification demand over the reference image.
Nevertheless, due to the expensive labor cost of training data annotation,
recent researchers have shifted to the challenging task of zero-shot CIR
(ZS-CIR), which targets fulfilling CIR without annotated triplets. The pioneer
ZS-CIR studies focus on converting the CIR task into a standard text-to-image
retrieval task by pre-training a textual inversion network that can map a given
image into a single pseudo-word token. Despite their significant progress,
their coarse-grained textual inversion may be insufficient to capture the full
content of the image accurately. To overcome this issue, in this work, we
propose a novel Fine-grained Textual Inversion Network for ZS-CIR, named
FTI4CIR. In particular, FTI4CIR comprises two main components: fine-grained
pseudo-word token mapping and tri-wise caption-based semantic regularization.
The former maps the image into a subject-oriented pseudo-word token and several
attribute-oriented pseudo-word tokens to comprehensively express the image in
the textual form, while the latter works on jointly aligning the fine-grained
pseudo-word tokens to the real-word token embedding space based on a
BLIP-generated image caption template. Extensive experiments conducted on three
benchmark datasets demonstrate the superiority of our proposed method. |
---|---|
DOI: | 10.48550/arxiv.2503.19296 |