Spatial Focusing and Progressive Decoupling Detector for High-Aspect-Ratio Rotated Objects

In recent years, remote sensing object detection has witnessed significant advancements through deep explorations of convolutional neural networks (CNNs) and vision transformer (ViT) architectures. However, detecting rotated objects with high aspect ratios remains challenging. Current detection fram...

Full description

Saved in:

Bibliographic Details
Published in	IEEE journal of selected topics in applied earth observations and remote sensing pp. 1 - 17
Main Authors	Liu, Zhe, He, Guiqing, Liu, Letian, Dong, Liheng, Jiang, Xiaoyue
Format	Journal Article
Language	English
Published	IEEE 15.10.2025
Subjects	Computer architecture Computer vision Convolution Detectors Feature extraction hierarchical decoupling network Kernel Object detection Oriented object detection Remote sensing Robustness Transformers vision transformer
Online Access	Get full text
ISSN	1939-1404 2151-1535 2151-1535
DOI	10.1109/JSTARS.2025.3622338

Cover

More Information
Summary:	In recent years, remote sensing object detection has witnessed significant advancements through deep explorations of convolutional neural networks (CNNs) and vision transformer (ViT) architectures. However, detecting rotated objects with high aspect ratios remains challenging. Current detection frameworks inadequately address the anisotropic feature distribution caused by such objects: feature information is highly concentrated in one spatial dimension while being sparse in another; and there are significant feature differences in the parameters representing the bounding box. To address this issue, we propose a Spatial Focusing and Progressive Decoupling Detector (SFPD-Det), which consists of three components: the Spatially Crosswise Convolution Module (SCCM), Hierarchical Decoupling Network (HDN), and Dynamic Progressive Activation Masks (DPMs). The SCCM captures diverse spatial features with long-range dependencies by combining square convolutions with multi-branch orthogonal large strip convolutions, enhancing the model adaptability to objects with varying aspect ratios. The HDN is composed of stacked ViT blocks and uses separate network branches to predict the position, angle, and size of bounding boxes in a cascaded manner. Furthermore, by combining the predicted parameters, we propose DPMs that embed the mask information of potential object boundary regions into the HDN, which progressively guide the self-attention to enhance cirtical features within the region of interest, thereby achieving precise bounding box regression. Extensive experiments on four benchmark remote sensing datasets (DOTA, DIOR-R, HRSC2016, and UCAS-AOD) demonstrate that our SFPD-Det achieves superior performance when compared with state-of-the-art detectors.
ISSN:	1939-1404 2151-1535 2151-1535
DOI:	10.1109/JSTARS.2025.3622338