A three-stream fusion network for 3D skeleton-based action recognition A three-stream fusion network for 3D skeleton-based action recognition

The progress in the skeleton-based action recognition is advancing rapidly in recent years. However, the spatial-temporal features between joints are not systematically explored. In this study, we propose a three-stream fusion network (3sFN) that aims to fully learn action features by mining multian...

Full description

Saved in:
Bibliographic Details
Published inMultimedia systems Vol. 31; no. 3; p. 176
Main Authors Fang, Ming, Liu, Qi, Ren, Jianping, Li, Jie, Du, Xinning, Liu, Shuhua
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer Berlin Heidelberg 01.04.2025
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN0942-4962
1432-1882
DOI10.1007/s00530-025-01762-0

Cover

More Information
Summary:The progress in the skeleton-based action recognition is advancing rapidly in recent years. However, the spatial-temporal features between joints are not systematically explored. In this study, we propose a three-stream fusion network (3sFN) that aims to fully learn action features by mining multiangle and multiscale spatial-temporal information between joints. The proposed model first uses Spatial-Temporal Graph Convolutional Network (ST-GCN) and Res2NeXt-50 to obtain motion spatial-temporal information from the spatial-temporal graph, SkeleMotion, and tree-structure-reference-joints image (TSRJI). The spatial-temporal features obtained from the independent learning of these three components are fine-tuned and fused to make full use of the complementarity and diversity among the three features. The spatial-temporal graph, SkeleMotion image, and TSRJI image are all obtained from human skeleton data. The model of 3sFN performs explicit modeling separately on the temporal information and spatial information of the action and solves the shortcomings of ST-GCN, which extracts a single combined feature of temporal and spatial information. The model is tested on the NTU RGB+D 60 and NTU RGB+D 120 datasets, and accuracies of 97.43% on Cross-Subject with 99.19% on Cross-View for the first dataset, and 96.35% on Cross-Subject with 97.34% on Cross-Setup for the second dataset, are obtained. The experimental results show that the proposed model achieves the state-of-the-art action recognition.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0942-4962
1432-1882
DOI:10.1007/s00530-025-01762-0