CNN-based and DTW features for human activity recognition on depth maps

In this work, we present a new algorithm for human action recognition on raw depth maps. At the beginning, for each class we train a separate one-against-all convolutional neural network (CNN) to extract class-specific features representing person shape. Each class-specific, multivariate time-series...

Full description

Saved in:

Bibliographic Details
Published in	Neural computing & applications Vol. 33; no. 21; pp. 14551 - 14563
Main Authors	Trelinski, Jacek, Kwolek, Bogdan
Format	Journal Article
Language	English
Published	London Springer London 01.11.2021 Springer Nature B.V
Subjects	Accuracy Algorithms Artificial Intelligence Artificial neural networks Classifiers Computational Biology/Bioinformatics Computational Science and Engineering Computer Science Data Mining and Knowledge Discovery Datasets Digital cameras Feature extraction Feature recognition Human activity recognition Image Processing and Computer Vision Mathematical analysis Moving object recognition Multivariate analysis Neural networks Original Article Probability and Statistics in Computer Science Probability distribution Sensors Statistical analysis Multivariate time-series Depth-based human action recognition Convolutional neural networks Ensembles
Online Access	Get full text
ISSN	0941-0643 1433-3058 1433-3058
DOI	10.1007/s00521-021-06097-1

Cover

More Information
Summary:	In this work, we present a new algorithm for human action recognition on raw depth maps. At the beginning, for each class we train a separate one-against-all convolutional neural network (CNN) to extract class-specific features representing person shape. Each class-specific, multivariate time-series is processed by a Siamese multichannel 1D CNN or a multichannel 1D CNN to determine features representing actions. Afterwards, for the nonzero pixels representing the person shape in each depth map we calculate statistical features. On multivariate time-series of such features we determine Dynamic Time Warping (DTW) features. They are determined on the basis of DTW distances between all training time-series. Finally, each class-specific feature vector is concatenated with the DTW feature vector. For each action category we train a multiclass classifier, which predicts probability distribution of class labels. From pool of such classifiers we select a number of classifiers such that an ensemble built on them achieves the best classification accuracy. Action recognition is performed by a soft voting ensemble that averages distributions calculated by such classifiers with the largest discriminative power. We demonstrate experimentally that on MSR-Action3D and UTD-MHAD datasets the proposed algorithm attains promising results and outperforms several state-of-the-art depth-based algorithms.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0941-0643 1433-3058 1433-3058
DOI:	10.1007/s00521-021-06097-1