A three-stream fusion network for 3D skeleton-based action recognition A three-stream fusion network for 3D skeleton-based action recognition

The progress in the skeleton-based action recognition is advancing rapidly in recent years. However, the spatial-temporal features between joints are not systematically explored. In this study, we propose a three-stream fusion network (3sFN) that aims to fully learn action features by mining multian...

Full description

Saved in:

Bibliographic Details
Published in	Multimedia systems Vol. 31; no. 3; p. 176
Main Authors	Fang, Ming, Liu, Qi, Ren, Jianping, Li, Jie, Du, Xinning, Liu, Shuhua
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.04.2025 Springer Nature B.V
Subjects	Activity recognition Artificial neural networks Computer Communication Networks Computer Graphics Computer Science Computer vision Cryptology Data Storage Representation Datasets Methods Multimedia Information Systems Neural networks Operating Systems Regular Paper Spatial data Action recognition GCN Res2NeXt Human skeleton
Online Access	Get full text
ISSN	0942-4962 1432-1882
DOI	10.1007/s00530-025-01762-0

Cover

More Information
Summary:	The progress in the skeleton-based action recognition is advancing rapidly in recent years. However, the spatial-temporal features between joints are not systematically explored. In this study, we propose a three-stream fusion network (3sFN) that aims to fully learn action features by mining multiangle and multiscale spatial-temporal information between joints. The proposed model first uses Spatial-Temporal Graph Convolutional Network (ST-GCN) and Res2NeXt-50 to obtain motion spatial-temporal information from the spatial-temporal graph, SkeleMotion, and tree-structure-reference-joints image (TSRJI). The spatial-temporal features obtained from the independent learning of these three components are fine-tuned and fused to make full use of the complementarity and diversity among the three features. The spatial-temporal graph, SkeleMotion image, and TSRJI image are all obtained from human skeleton data. The model of 3sFN performs explicit modeling separately on the temporal information and spatial information of the action and solves the shortcomings of ST-GCN, which extracts a single combined feature of temporal and spatial information. The model is tested on the NTU RGB+D 60 and NTU RGB+D 120 datasets, and accuracies of 97.43% on Cross-Subject with 99.19% on Cross-View for the first dataset, and 96.35% on Cross-Subject with 97.34% on Cross-Setup for the second dataset, are obtained. The experimental results show that the proposed model achieves the state-of-the-art action recognition.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0942-4962 1432-1882
DOI:	10.1007/s00530-025-01762-0