Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition
Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
13.07.2023
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2307.06947 |
Cover
Summary: | Recent video recognition models utilize Transformer models for long-range
spatio-temporal context modeling. Video transformer designs are based on
self-attention that can model global context at a high computational cost. In
comparison, convolutional designs for videos offer an efficient alternative but
lack long-range dependency modeling. Towards achieving the best of both
designs, this work proposes Video-FocalNet, an effective and efficient
architecture for video recognition that models both local and global contexts.
Video-FocalNet is based on a spatio-temporal focal modulation architecture that
reverses the interaction and aggregation steps of self-attention for better
efficiency. Further, the aggregation step and the interaction step are both
implemented using efficient convolution and element-wise multiplication
operations that are computationally less expensive than their self-attention
counterparts on video representations. We extensively explore the design space
of focal modulation-based spatio-temporal context modeling and demonstrate our
parallel spatial and temporal encoding design to be the optimal choice.
Video-FocalNets perform favorably well against the state-of-the-art
transformer-based models for video recognition on five large-scale datasets
(Kinetics-400, Kinetics-600, SS-v2, Diving-48, and ActivityNet-1.3) at a lower
computational cost. Our code/models are released at
https://github.com/TalalWasim/Video-FocalNets. |
---|---|
DOI: | 10.48550/arxiv.2307.06947 |