Dynamic cache-updating contrastive language-image pre-training model for few-shot remote sensing image classification
Remote sensing images often contain sensitive information related to national security and military secrets, making it difficult to acquire sufficient data samples. Although contrastive language-image pre-training (CLIP) supports zero-shot recognition, its pre-training on general images limits its e...
Saved in:
| Published in | Journal of applied remote sensing Vol. 19; no. 2; p. 026502 |
|---|---|
| Main Authors | , , |
| Format | Journal Article |
| Language | English |
| Published |
Society of Photo-Optical Instrumentation Engineers
01.04.2025
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 1931-3195 1931-3195 |
| DOI | 10.1117/1.JRS.19.026502 |
Cover
| Summary: | Remote sensing images often contain sensitive information related to national security and military secrets, making it difficult to acquire sufficient data samples. Although contrastive language-image pre-training (CLIP) supports zero-shot recognition, its pre-training on general images limits its effectiveness for remote sensing classification. The Tip-Adapter method improves adaptation by introducing a knowledge cache, but its performance is limited when the cache size is small. To address these challenges, we propose a dynamic cache-updating CLIP model that enhances classification accuracy by iteratively selecting Top-K pseudo-labels from Tip-Adapter’s predictions and updating the knowledge cache. An adaptive weight adjustment module is also introduced to balance performance across categories, ensuring that classes with lower accuracy receive more focus, thus mitigating accuracy drops due to high inter-class similarity. We investigate the impact of different Top-K values and compare the use of soft versus hard labels for pseudo-labeling. Results show that hard labels provide clearer assignments and lead to better performance. Although increasing the Top-K value improves accuracy by expanding the knowledge cache, excessive Top-K values reduce label credibility, necessitating a balance between cache size and label quality. On the EuroSAT dataset, with Tip-Adapter as the base model, updating the top-2 hard labels for 15 iterations increased one-shot accuracy from 54.89% to 73.25% and two-shot accuracy from 59.41% to 74.78%. Similarly, with Tip-Adapter-F as the base model, one-shot accuracy improved from 65.06% to 79.12%, and two-shot accuracy from 65.11% to 80.78%. Similar improvements were observed on the UCMerced (UCM) and NWPU-RESISC45 datasets, further demonstrating the effectiveness and generalizability of the proposed method. Our code will be released at https://github.com/xu-c22/DCU-CLIP |
|---|---|
| ISSN: | 1931-3195 1931-3195 |
| DOI: | 10.1117/1.JRS.19.026502 |