B2C-AFM: Bi-directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition

Human Action Recognition plays a driving engine of many human-computer interaction applications. Most current researches focus on improving the model generalization by integrating multiple homogeneous modalities, including RGB images, human poses, and optical flows. Furthermore, contextual interacti...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on image processing Vol. 32; p. 1
Main Authors	Guo, Fangtai, Jin, Tianlei, Zhu, Shiqiang, Xi, Xiangming, Wang, Wen, Meng, Qiwei, Song, Wei, Zhu, Jiakai
Format	Journal Article
Language	English
Published	New York IEEE 01.01.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Ablation B2C-AFM Color imagery fusion model homogeneous modalities Human action recognition Human activity recognition limb flow fields Optical flow (image analysis)
Online Access	Get full text
ISSN	1057-7149 1941-0042 1941-0042
DOI	10.1109/TIP.2023.3308750

Cover

More Information
Summary:	Human Action Recognition plays a driving engine of many human-computer interaction applications. Most current researches focus on improving the model generalization by integrating multiple homogeneous modalities, including RGB images, human poses, and optical flows. Furthermore, contextual interactions and out-of-context sign languages have been validated to depend on scene category and human per se. Those attempts to integrate appearance features and human poses have shown positive results. However, with human poses' spatial errors and temporal ambiguities, existing methods are subject to poor scalability, limited robustness, and sub-optimal models. In this paper, inspired by the assumption that different modalities may maintain temporal consistency and spatial complementarity, we present a novel Bi-directional Co-temporal and Cross-spatial Attention Fusion Model (B2C-AFM). Our model is characterized by the asynchronous fusion strategy of multi-modal features along temporal and spatial dimensions. Besides, the novel explicit motion-oriented pose representations called Limb Flow Fields (Lff) are explored to alleviate the temporal ambiguity regarding human poses. Experiments on publicly available datasets validate our contributions. Abundant ablation studies experimentally show that B2C-AFM achieves robust performance across seen and unseen human actions. The codes are available here 1 .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1057-7149 1941-0042 1941-0042
DOI:	10.1109/TIP.2023.3308750