Improving skeleton-based action recognition with interactive object information

Human skeleton information is important in skeleton-based action recognition, which provides a simple and efficient way to describe human pose. However, existing skeleton-based methods focus more on the skeleton, ignoring the objects interacting with humans, resulting in poor performance in recogniz...

Full description

Saved in:

Bibliographic Details
Published in	International journal of multimedia information retrieval Vol. 14; no. 1; p. 3
Main Authors	Wen, Hao, Lu, Ziqian, Shen, Fengli, Lu, Zhe-Ming, Cui, Jialin
Format	Journal Article
Language	English
Published	London Springer London 01.03.2025 Springer Nature B.V
Subjects	Activity recognition Artificial neural networks Computer Science Data augmentation Data Mining and Knowledge Discovery Database Management Datasets Digitization Factories Graphs Image Processing and Computer Vision Information Storage and Retrieval Information Systems Applications (incl.Internet) Methods Moving object recognition Multimedia Information Systems Nodes Performance evaluation Regular Paper Action Recognition Human-Object Interaction Smart Factory Graph Convolutional Networks
Online Access	Get full text
ISSN	2192-6611 2192-662X
DOI	10.1007/s13735-024-00351-7

Cover

More Information
Summary:	Human skeleton information is important in skeleton-based action recognition, which provides a simple and efficient way to describe human pose. However, existing skeleton-based methods focus more on the skeleton, ignoring the objects interacting with humans, resulting in poor performance in recognizing actions that involve object interactions. We propose a new action recognition framework introducing object nodes to supplement absent interactive object information. We also propose Spatial Temporal Variable Graph Convolutional Networks (ST-VGCN) to effectively model the Variable Graph (VG) containing object nodes. Specifically, in order to validate the role of interactive object information, by leveraging a simple self-training approach, we establish a new dataset, JXGC 24, and an extended dataset, NTU RGB+D+Object 60, including more than 2 million additional object nodes. At the same time, we designe the Variable Graph construction method to accommodate a variable number of nodes for graph structure. Additionally, we are the first to explore the overfitting issue introduced by incorporating additional object information, and we propose a VG-based data augmentation method to address this issue, called Random Node Attack. Finally, regarding the network structure, we introduce two fusion modules, CAF and WNPool, along with a novel Node Balance Loss, to enhance the comprehensive performance by effectively fusing and balancing skeleton and object node information. Our method surpasses the previous state-of-the-art on multiple skeleton-based action recognition benchmarks. The accuracy of our method on NTU RGB+D 60 cross-subject split is 96.7%, and on cross-view split, it is 99.2%. The project page: https://github.com/moonlight52137/ST-VGCN .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2192-6611 2192-662X
DOI:	10.1007/s13735-024-00351-7