ESVT: Event-Based Streaming Vision Transformer for Challenging Object Detection
Object detection is a crucial task in the field of remote sensing. Currently, frame-based algorithms have demonstrated impressive performance. However, research on remote sensing applying event cameras has not yet been conducted. Meanwhile, there are still three issues to address: 1) remote sensing...
Saved in:
| Published in | IEEE transactions on geoscience and remote sensing Vol. 63; pp. 1 - 13 |
|---|---|
| Main Authors | , , , , , , , |
| Format | Journal Article |
| Language | English |
| Published |
New York
IEEE
2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Subjects | |
| Online Access | Get full text |
| ISSN | 0196-2892 1558-0644 |
| DOI | 10.1109/TGRS.2025.3527474 |
Cover
| Summary: | Object detection is a crucial task in the field of remote sensing. Currently, frame-based algorithms have demonstrated impressive performance. However, research on remote sensing applying event cameras has not yet been conducted. Meanwhile, there are still three issues to address: 1) remote sensing targets are often disrupted by complex backgrounds, resulting in poor detection performance, especially in extremely challenging environments (e.g., low-light, motion blur, and occlusion scenarios); 2) mainstream deep learning neural networks primarily employ discrete random sampling training strategies, which limits the system to leverage continuous temporal information; and 3) the distribution shift problem arising from uneven data in streaming training poses challenges for temporal object detection. In this work, we provide the remote sensing event-based object detection dataset (RSEOD), which is the first remote sensing dataset utilizing event cameras while including various intractable scenarios, providing a novel perspective for object detection in challenging scenarios. Additionally, we innovatively propose an event-based streaming training strategy that utilizes asynchronous event streams to address detection challenges caused by prolonged occlusion and out-of-focus. Moreover, we introduce a reversible normalization criterion (RevNorm) to eliminate non-stationary information in temporal data, proposing a streaming bidirectional feature pyramid network (SBFPN) to facilitate recursive data transmission along the temporal dimension. Extensive experiments on the RSEOD dataset demonstrate that our method achieves 38.1% mAP@0.5:0.95 and 55.8% mAP@0.5, outperforming all other state-of-the-art object detection approaches (e.g., YOLOv8, YOLOv10, YOLOv11, DINO, RTDETR, RTDETRv2, SODFormer). The dataset and code are released at https://github.com/Jushl/ESVT . |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 0196-2892 1558-0644 |
| DOI: | 10.1109/TGRS.2025.3527474 |