Fine-Grained Video Captioning via Graph-based Multi-Granularity Interaction Learning

Learning to generate continuous linguistic descriptions for multi-subject interactive videos in great details has particular applications in team sports auto-narrative. In contrast to traditional video caption, this task is more challenging as it requires simultaneous modeling of fine-grained indivi...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on pattern analysis and machine intelligence Vol. 44; no. 2; pp. 666 - 683
Main Authors	Yan, Yichao, Zhuang, Ning, Ni, Bingbing, Zhang, Jian, Xu, Minghao, Zhang, Qiang, Zhang, Zheng, Cheng, Shuo, Tian, Qi, Xu, Yi, Yang, Xiaokang, Zhang, Wenjun
Format	Journal Article
Language	English
Published	United States IEEE 01.02.2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Datasets Descriptions Evaluation Feature extraction fine-grained Games graphCNN Humans Interaction models Learning Linguistics Measurement Modelling Modules multiple granularity Natural language processing representation learning Representations Sentences Sports Task analysis Team sports Three-dimensional displays Video Video caption
Online Access	Get full text
ISSN	0162-8828 1939-3539 2160-9292 1939-3539
DOI	10.1109/TPAMI.2019.2946823

Cover

More Information
Summary:	Learning to generate continuous linguistic descriptions for multi-subject interactive videos in great details has particular applications in team sports auto-narrative. In contrast to traditional video caption, this task is more challenging as it requires simultaneous modeling of fine-grained individual actions, uncovering of spatio-temporal dependency structures of frequent group interactions, and then accurate mapping of these complex interaction details into long and detailed commentary. To explicitly address these challenges, we propose a novel framework Graph-based Learning for Multi-Granularity Interaction Representation (GLMGIR) for fine-grained team sports auto-narrative task. A multi-granular interaction modeling module is proposed to extract among-subjects' interactive actions in a progressive way for encoding both intra- and inter-team interactions. Based on the above multi-granular representations, a multi-granular attention module is developed to consider action/event descriptions of multiple spatio-temporal resolutions. Both modules are integrated seamlessly and work in a collaborative way to generate the final narrative. In the meantime, to facilitate reproducible research, we collect a new video dataset from YouTube.com called Sports Video Narrative dataset (SVN). It is a novel direction as it contains <inline-formula><tex-math notation="LaTeX">6K</tex-math> <mml:math><mml:mrow><mml:mn>6</mml:mn><mml:mi>K</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="zhuang-ieq1-2946823.gif"/> </inline-formula> team sports videos (i.e., NBA basketball games) with <inline-formula><tex-math notation="LaTeX">10K</tex-math> <mml:math><mml:mrow><mml:mn>10</mml:mn><mml:mi>K</mml:mi></mml:mrow></mml:math><inline-graphic xlink:href="zhuang-ieq2-2946823.gif"/> </inline-formula> ground-truth narratives(e.g., sentences). Furthermore, as previous metrics such as METEOR (i.e., used in coarse-grained video caption task) DO NOT cope with fine-grained sports narrative task well, we hence develop a novel evaluation metric named Fine-grained Captioning Evaluation (FCE), which measures how accurate the generated linguistic description reflects fine-grained action details as well as the overall spatio-temporal interactional structure. Extensive experiments on our SVN dataset have demonstrated the effectiveness of the proposed framework for fine-grained team sports video auto-narrative.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	0162-8828 1939-3539 2160-9292 1939-3539
DOI:	10.1109/TPAMI.2019.2946823