Clartemis: Clip-Based Attention-Based Retrieval with Text-Explicit Matching and Implicit Similarity
This study tackles the challenge of image retrieval using text feedback, a crucial task in e-commerce. To address this, we introduce an improved framework called CLARTEMIS, built upon the foundation of ARTEMIS [4]. Our proposed approach incorporates a pre-trained vision-language model to align image...
Saved in:
| Published in | International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP) (Online) pp. 01 - 05 |
|---|---|
| Main Authors | , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
14.12.2024
|
| Subjects | |
| Online Access | Get full text |
| ISBN | 9798331519230 |
| ISSN | 2576-8964 |
| DOI | 10.1109/ICCWAMTIP64812.2024.10873653 |
Cover
| Summary: | This study tackles the challenge of image retrieval using text feedback, a crucial task in e-commerce. To address this, we introduce an improved framework called CLARTEMIS, built upon the foundation of ARTEMIS [4]. Our proposed approach incorporates a pre-trained vision-language model to align image and text features within a unified semantic space. This integration not only streamlines multimodal representation learning but also enhances the consistency of the feature space for correlation modeling. The CLARTEMIS framework utilizes a robust joint embedding strategy, effectively aligning reference images, modification texts, and target images. Comprehensive experiments on a benchmark dataset validate the effectiveness of our method, demonstrating notable improvements in retrieval accuracy and underscoring the potential of pre-trained models in advancing multimodal retrieval systems. |
|---|---|
| ISBN: | 9798331519230 |
| ISSN: | 2576-8964 |
| DOI: | 10.1109/ICCWAMTIP64812.2024.10873653 |