ESVT: Event-Based Streaming Vision Transformer for Challenging Object Detection

Object detection is a crucial task in the field of remote sensing. Currently, frame-based algorithms have demonstrated impressive performance. However, research on remote sensing applying event cameras has not yet been conducted. Meanwhile, there are still three issues to address: 1) remote sensing...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on geoscience and remote sensing Vol. 63; pp. 1 - 13
Main Authors Jing, Shilong, Guo, Guangsha, Xu, Xianda, Zhao, Yuchen, Wang, Hechong, Lv, Hengyi, Feng, Yang, Zhang, Yisa
Format Journal Article
LanguageEnglish
Published New York IEEE 2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN0196-2892
1558-0644
DOI10.1109/TGRS.2025.3527474

Cover

More Information
Summary:Object detection is a crucial task in the field of remote sensing. Currently, frame-based algorithms have demonstrated impressive performance. However, research on remote sensing applying event cameras has not yet been conducted. Meanwhile, there are still three issues to address: 1) remote sensing targets are often disrupted by complex backgrounds, resulting in poor detection performance, especially in extremely challenging environments (e.g., low-light, motion blur, and occlusion scenarios); 2) mainstream deep learning neural networks primarily employ discrete random sampling training strategies, which limits the system to leverage continuous temporal information; and 3) the distribution shift problem arising from uneven data in streaming training poses challenges for temporal object detection. In this work, we provide the remote sensing event-based object detection dataset (RSEOD), which is the first remote sensing dataset utilizing event cameras while including various intractable scenarios, providing a novel perspective for object detection in challenging scenarios. Additionally, we innovatively propose an event-based streaming training strategy that utilizes asynchronous event streams to address detection challenges caused by prolonged occlusion and out-of-focus. Moreover, we introduce a reversible normalization criterion (RevNorm) to eliminate non-stationary information in temporal data, proposing a streaming bidirectional feature pyramid network (SBFPN) to facilitate recursive data transmission along the temporal dimension. Extensive experiments on the RSEOD dataset demonstrate that our method achieves 38.1% mAP@0.5:0.95 and 55.8% mAP@0.5, outperforming all other state-of-the-art object detection approaches (e.g., YOLOv8, YOLOv10, YOLOv11, DINO, RTDETR, RTDETRv2, SODFormer). The dataset and code are released at https://github.com/Jushl/ESVT .
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0196-2892
1558-0644
DOI:10.1109/TGRS.2025.3527474