ESVT: Event-Based Streaming Vision Transformer for Challenging Object Detection

Object detection is a crucial task in the field of remote sensing. Currently, frame-based algorithms have demonstrated impressive performance. However, research on remote sensing applying event cameras has not yet been conducted. Meanwhile, there are still three issues to address: 1) remote sensing...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on geoscience and remote sensing Vol. 63; pp. 1 - 13
Main Authors	Jing, Shilong, Guo, Guangsha, Xu, Xianda, Zhao, Yuchen, Wang, Hechong, Lv, Hengyi, Feng, Yang, Zhang, Yisa
Format	Journal Article
Language	English
Published	New York IEEE 2025 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Cameras Data transmission Datasets Deep learning Detectors Event cameras Feature extraction Machine learning Neural networks Neuromorphics Object recognition Occlusion Random sampling Remote sensing remote sensing scene sequence information normalization Spatiotemporal phenomena Statistical sampling Streaming Streaming media streaming object detection Streams Target detection Training Transformers
Online Access	Get full text
ISSN	0196-2892 1558-0644
DOI	10.1109/TGRS.2025.3527474

Cover

More Information
Summary:	Object detection is a crucial task in the field of remote sensing. Currently, frame-based algorithms have demonstrated impressive performance. However, research on remote sensing applying event cameras has not yet been conducted. Meanwhile, there are still three issues to address: 1) remote sensing targets are often disrupted by complex backgrounds, resulting in poor detection performance, especially in extremely challenging environments (e.g., low-light, motion blur, and occlusion scenarios); 2) mainstream deep learning neural networks primarily employ discrete random sampling training strategies, which limits the system to leverage continuous temporal information; and 3) the distribution shift problem arising from uneven data in streaming training poses challenges for temporal object detection. In this work, we provide the remote sensing event-based object detection dataset (RSEOD), which is the first remote sensing dataset utilizing event cameras while including various intractable scenarios, providing a novel perspective for object detection in challenging scenarios. Additionally, we innovatively propose an event-based streaming training strategy that utilizes asynchronous event streams to address detection challenges caused by prolonged occlusion and out-of-focus. Moreover, we introduce a reversible normalization criterion (RevNorm) to eliminate non-stationary information in temporal data, proposing a streaming bidirectional feature pyramid network (SBFPN) to facilitate recursive data transmission along the temporal dimension. Extensive experiments on the RSEOD dataset demonstrate that our method achieves 38.1% mAP@0.5:0.95 and 55.8% mAP@0.5, outperforming all other state-of-the-art object detection approaches (e.g., YOLOv8, YOLOv10, YOLOv11, DINO, RTDETR, RTDETRv2, SODFormer). The dataset and code are released at https://github.com/Jushl/ESVT .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0196-2892 1558-0644
DOI:	10.1109/TGRS.2025.3527474