GroupMamba: Efficient Group-Based Visual State Space Model
State-space models (SSMs) have recently shown promise in capturing long-range dependencies with subquadratic computational complexity, making them attractive for various applications. However, purely SSM-based models face critical challenges related to stability and achieving state-of-the-art perfor...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
18.07.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2407.13772 |
Cover
Summary: | State-space models (SSMs) have recently shown promise in capturing long-range
dependencies with subquadratic computational complexity, making them attractive
for various applications. However, purely SSM-based models face critical
challenges related to stability and achieving state-of-the-art performance in
computer vision tasks. Our paper addresses the challenges of scaling SSM-based
models for computer vision, particularly the instability and inefficiency of
large model sizes. We introduce a parameter-efficient modulated group mamba
layer that divides the input channels into four groups and applies our proposed
SSM-based efficient Visual Single Selective Scanning (VSSS) block independently
to each group, with each VSSS block scanning in one of the four spatial
directions. The Modulated Group Mamba layer also wraps the four VSSS blocks
into a channel modulation operator to improve cross-channel communication.
Furthermore, we introduce a distillation-based training objective to stabilize
the training of large models, leading to consistent performance gains. Our
comprehensive experiments demonstrate the merits of the proposed contributions,
leading to superior performance over existing methods for image classification
on ImageNet-1K, object detection, instance segmentation on MS-COCO, and
semantic segmentation on ADE20K. Our tiny variant with 23M parameters achieves
state-of-the-art performance with a classification top-1 accuracy of 83.3% on
ImageNet-1K, while being 26% efficient in terms of parameters, compared to the
best existing Mamba design of same model size. Code and models are available
at: https://github.com/Amshaker/GroupMamba. |
---|---|
DOI: | 10.48550/arxiv.2407.13772 |