Dynamic cache-updating contrastive language-image pre-training model for few-shot remote sensing image classification

Remote sensing images often contain sensitive information related to national security and military secrets, making it difficult to acquire sufficient data samples. Although contrastive language-image pre-training (CLIP) supports zero-shot recognition, its pre-training on general images limits its e...

Full description

Saved in:
Bibliographic Details
Published inJournal of applied remote sensing Vol. 19; no. 2; p. 026502
Main Authors Xu, Cheng, Ou, Zhengyu, Han, Zandong
Format Journal Article
LanguageEnglish
Published Society of Photo-Optical Instrumentation Engineers 01.04.2025
Subjects
Online AccessGet full text
ISSN1931-3195
1931-3195
DOI10.1117/1.JRS.19.026502

Cover

More Information
Summary:Remote sensing images often contain sensitive information related to national security and military secrets, making it difficult to acquire sufficient data samples. Although contrastive language-image pre-training (CLIP) supports zero-shot recognition, its pre-training on general images limits its effectiveness for remote sensing classification. The Tip-Adapter method improves adaptation by introducing a knowledge cache, but its performance is limited when the cache size is small. To address these challenges, we propose a dynamic cache-updating CLIP model that enhances classification accuracy by iteratively selecting Top-K pseudo-labels from Tip-Adapter’s predictions and updating the knowledge cache. An adaptive weight adjustment module is also introduced to balance performance across categories, ensuring that classes with lower accuracy receive more focus, thus mitigating accuracy drops due to high inter-class similarity. We investigate the impact of different Top-K values and compare the use of soft versus hard labels for pseudo-labeling. Results show that hard labels provide clearer assignments and lead to better performance. Although increasing the Top-K value improves accuracy by expanding the knowledge cache, excessive Top-K values reduce label credibility, necessitating a balance between cache size and label quality. On the EuroSAT dataset, with Tip-Adapter as the base model, updating the top-2 hard labels for 15 iterations increased one-shot accuracy from 54.89% to 73.25% and two-shot accuracy from 59.41% to 74.78%. Similarly, with Tip-Adapter-F as the base model, one-shot accuracy improved from 65.06% to 79.12%, and two-shot accuracy from 65.11% to 80.78%. Similar improvements were observed on the UCMerced (UCM) and NWPU-RESISC45 datasets, further demonstrating the effectiveness and generalizability of the proposed method. Our code will be released at https://github.com/xu-c22/DCU-CLIP
ISSN:1931-3195
1931-3195
DOI:10.1117/1.JRS.19.026502