GOFENet: A Hybrid Transformer–CNN Network Integrating GEOBIA-Based Object Priors for Semantic Segmentation of Remote Sensing Images

Geographic object-based image analysis (GEOBIA) has demonstrated substantial utility in remote sensing tasks. However, its integration with deep learning remains largely confined to image-level classification. This is primarily due to the irregular shapes and fragmented boundaries of segmented objec...

Full description

Saved in:

Bibliographic Details
Published in	Remote sensing (Basel, Switzerland) Vol. 17; no. 15; p. 2652
Main Authors	He, Tao, Chen, Jianyu, Pan, Delu
Format	Journal Article
Language	English
Published	Basel MDPI AG 31.07.2025
Subjects	Accuracy Algorithms Artificial neural networks Boundaries Classification Connectivity Deep learning Design Feature extraction geographic object-based image analysis global–local context Image analysis Image processing Image segmentation Land cover land cover classification Machine learning Modules multi-scale optimized segmentation Neural networks Remote sensing Representations Semantic segmentation Semantics
Online Access	Get full text
ISSN	2072-4292 2072-4292
DOI	10.3390/rs17152652

Cover

More Information
Summary:	Geographic object-based image analysis (GEOBIA) has demonstrated substantial utility in remote sensing tasks. However, its integration with deep learning remains largely confined to image-level classification. This is primarily due to the irregular shapes and fragmented boundaries of segmented objects, which limit its applicability in semantic segmentation. While convolutional neural networks (CNNs) excel at local feature extraction, they inherently struggle to capture long-range dependencies. In contrast, Transformer-based models are well suited for global context modeling but often lack fine-grained local detail. To overcome these limitations, we propose GOFENet (Geo-Object Feature Enhanced Network)—a hybrid semantic segmentation architecture that effectively fuses object-level priors into deep feature representations. GOFENet employs a dual-encoder design combining CNN and Swin Transformer architectures, enabling multi-scale feature fusion through skip connections to preserve both local and global semantics. An auxiliary branch incorporating cascaded atrous convolutions is introduced to inject information of segmented objects into the learning process. Furthermore, we develop a cross-channel selection module (CSM) for refined channel-wise attention, a feature enhancement module (FEM) to merge global and local representations, and a shallow–deep feature fusion module (SDFM) to integrate pixel- and object-level cues across scales. Experimental results on the GID and LoveDA datasets demonstrate that GOFENet achieves superior segmentation performance, with 66.02% mIoU and 51.92% mIoU, respectively. The model exhibits strong capability in delineating large-scale land cover features, producing sharper object boundaries and reducing classification noise, while preserving the integrity and discriminability of land cover categories.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2072-4292 2072-4292
DOI:	10.3390/rs17152652