Multilevel Depth and Image Fusion for Human Activity Detection

Recognizing complex human activities usually requires the detection and modeling of individual visual features and the interactions between them. Current methods only rely on the visual features extracted from 2-D images, and therefore often lead to unreliable salient visual feature detection and in...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on cybernetics Vol. 43; no. 5; pp. 1383 - 1394
Main Authors	Bingbing Ni, Yong Pei, Moulin, Pierre, Shuicheng Yan
Format	Journal Article
Language	English
Published	United States IEEE 01.10.2013 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Accuracy Actigraphy - instrumentation Actigraphy - methods Action recognition and localization Algorithms Artificial Intelligence Computer Peripherals Computer Simulation Computer Systems Context modeling depth sensor Feature extraction Gray-scale Human motion Humans Image detection Image Enhancement - instrumentation Image Enhancement - methods Image processing Image recognition Imaging, Three-Dimensional - methods Joints Pattern Recognition, Automated - methods Position (location) Recognition spatial and temporal context Studies Subtraction Technique Transducers Video Video Games Visual Visualization Whole Body Imaging - instrumentation Whole Body Imaging - methods
Online Access	Get full text
ISSN	2168-2267 2168-2275 2168-2275
DOI	10.1109/TCYB.2013.2276433

Cover

More Information
Summary:	Recognizing complex human activities usually requires the detection and modeling of individual visual features and the interactions between them. Current methods only rely on the visual features extracted from 2-D images, and therefore often lead to unreliable salient visual feature detection and inaccurate modeling of the interaction context between individual features. In this paper, we show that these problems can be addressed by combining data from a conventional camera and a depth sensor (e.g., Microsoft Kinect). We propose a novel complex activity recognition and localization framework that effectively fuses information from both grayscale and depth image channels at multiple levels of the video processing pipeline. In the individual visual feature detection level, depth-based filters are applied to the detected human/object rectangles to remove false detections. In the next level of interaction modeling, 3-D spatial and temporal contexts among human subjects or objects are extracted by integrating information from both grayscale and depth images. Depth information is also utilized to distinguish different types of indoor scenes. Finally, a latent structural model is developed to integrate the information from multiple levels of video processing for an activity detection. Extensive experiments on two activity recognition benchmarks (one with depth information) and a challenging grayscale + depth human activity database that contains complex interactions between human-human, human-object, and human-surroundings demonstrate the effectiveness of the proposed multilevel grayscale + depth fusion scheme. Higher recognition and localization accuracies are obtained relative to the previous methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2168-2267 2168-2275 2168-2275
DOI:	10.1109/TCYB.2013.2276433