Semantic Decomposition and Recognition of Long and Complex Manipulation Action Sequences

Understanding continuous human actions is a non-trivial but important problem in computer vision. Although there exists a large corpus of work in the recognition of action sequences, most approaches suffer from problems relating to vast variations in motions, action combinations, and scene contexts....

Full description

Saved in:

Bibliographic Details
Published in	International journal of computer vision Vol. 122; no. 1; pp. 84 - 115
Main Authors	Aksoy, Eren Erdal, Orhan, Adil, Wörgötter, Florentin
Format	Journal Article
Language	English
Published	New York Springer US 01.03.2017 Springer Springer Nature B.V
Subjects	Artificial Intelligence Chains Computer Imaging Computer Science Computer vision Decomposition Employee motivation Human body Image Processing and Computer Vision Image processing systems Invariants Motion detectors Object recognition Pattern Recognition Pattern Recognition and Graphics Recognition Robustness Semantics Studies Tasks Taxonomy Vision Vision systems United States > US Action recognition Semantic decomposition Temporal segmentation Semantic event chain Manipulation action
Online Access	Get full text
ISSN	0920-5691 1573-1405
DOI	10.1007/s11263-016-0956-8

Cover

More Information
Summary:	Understanding continuous human actions is a non-trivial but important problem in computer vision. Although there exists a large corpus of work in the recognition of action sequences, most approaches suffer from problems relating to vast variations in motions, action combinations, and scene contexts. In this paper, we introduce a novel method for semantic segmentation and recognition of long and complex manipulation action tasks, such as “preparing a breakfast” or “making a sandwich”. We represent manipulations with our recently introduced “Semantic Event Chain” (SEC) concept, which captures the underlying spatiotemporal structure of an action invariant to motion, velocity, and scene context. Solely based on the spatiotemporal interactions between manipulated objects and hands in the extracted SEC, the framework automatically parses individual manipulation streams performed either sequentially or concurrently. Using event chains, our method further extracts basic primitive elements of each parsed manipulation. Without requiring any prior object knowledge, the proposed framework can also extract object-like scene entities that exhibit the same role in semantically similar manipulations. We conduct extensive experiments on various recent datasets to validate the robustness of the framework.
Bibliography:	SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23
ISSN:	0920-5691 1573-1405
DOI:	10.1007/s11263-016-0956-8