Dual-Excitation Spatial-Temporal Graph Convolution Network for Skeleton-Based Action Recognition

The crucial issue in current methods for skeleton-based action recognition: how to comprehensively capture the evolving features of global context information and temporal dynamics, and how to extract discriminative representations from skeleton joints and body parts. To address these issues, this a...

Full description

Saved in:

Bibliographic Details
Published in	IEEE sensors journal Vol. 24; no. 6; pp. 8184 - 8196
Main Authors	Lu, Jian, Huang, Tingting, Zhao, Bo, Chen, Xiaogai, Zhou, Jian, Zhang, Kaibing
Format	Journal Article
Language	English
Published	New York IEEE 15.03.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Activity recognition Body parts Channel excitation Channels Computational modeling Context Convolution Data mining Datasets Excitation Feature extraction feature pyramid graph convolution Graph convolutional networks Joints Modules motion excitation Representations Sensors skeleton-based action recognition
Online Access	Get full text
ISSN	1530-437X 1558-1748
DOI	10.1109/JSEN.2024.3354922

Cover

More Information
Summary:	The crucial issue in current methods for skeleton-based action recognition: how to comprehensively capture the evolving features of global context information and temporal dynamics, and how to extract discriminative representations from skeleton joints and body parts. To address these issues, this article proposes a dual-excitation spatial-temporal graph convolutional method. The method adopts a pyramid aggregation structure formed through group convolution, resulting in a pyramid channel-split graph convolution module. The objective is to integrate context information of different scales by splitting channels, facilitating the interaction of information with different dimensions between channels, and establishing dependencies between channels. Subsequently, a motion excitation module is introduced, which activates motion-sensitive channels by grouping feature channels and calculating feature differences across the temporal dimension. This approach forces the model to focus on discriminative features with motion changes. Additionally, a dual attention mechanism is proposed to highlight key joints and body parts within the overall skeleton action sequence, leading to a more interpretable representation for diverse action sequences. On the NTU RGB+D 60 dataset, the accuracy of cross-subject (X-Sub) and cross-view (X-View) reaches 91.6% and 96.9%, respectively. On the NTU RGB+D 120 dataset, the accuracy for X-Sub and cross-setup (X-Set) is 87.5% and 88.5%, respectively, outperforming other methods and highlighting the effectiveness of the proposed approach in this study.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1530-437X 1558-1748
DOI:	10.1109/JSEN.2024.3354922