Clartemis: Clip-Based Attention-Based Retrieval with Text-Explicit Matching and Implicit Similarity

This study tackles the challenge of image retrieval using text feedback, a crucial task in e-commerce. To address this, we introduce an improved framework called CLARTEMIS, built upon the foundation of ARTEMIS [4]. Our proposed approach incorporates a pre-trained vision-language model to align image...

Full description

Saved in:
Bibliographic Details
Published inInternational Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP) (Online) pp. 01 - 05
Main Authors Tran, Hoang-Anh, Le, Hoanh-Su, Nguyen, Phuc
Format Conference Proceeding
LanguageEnglish
Published IEEE 14.12.2024
Subjects
Online AccessGet full text
ISBN9798331519230
ISSN2576-8964
DOI10.1109/ICCWAMTIP64812.2024.10873653

Cover

More Information
Summary:This study tackles the challenge of image retrieval using text feedback, a crucial task in e-commerce. To address this, we introduce an improved framework called CLARTEMIS, built upon the foundation of ARTEMIS [4]. Our proposed approach incorporates a pre-trained vision-language model to align image and text features within a unified semantic space. This integration not only streamlines multimodal representation learning but also enhances the consistency of the feature space for correlation modeling. The CLARTEMIS framework utilizes a robust joint embedding strategy, effectively aligning reference images, modification texts, and target images. Comprehensive experiments on a benchmark dataset validate the effectiveness of our method, demonstrating notable improvements in retrieval accuracy and underscoring the potential of pre-trained models in advancing multimodal retrieval systems.
ISBN:9798331519230
ISSN:2576-8964
DOI:10.1109/ICCWAMTIP64812.2024.10873653