Visual tracking by matching points using diffusion model

Existing Siamese and Transformer-based trackers commonly approach Visual Object Tracking (VOT) as a one-shot detection problem, where the target object is located in a single forward evaluation. While effective, this approach lacks self-correction mechanisms, making these trackers prone to drifting...

Full description

Saved in:
Bibliographic Details
Published inAlexandria engineering journal Vol. 127; pp. 787 - 803
Main Authors Alansari, Mohamad, Ganapathi, Iyyakutti Iyappan, Alansari, Sara, Al Marzouqi, Hasan, Javed, Sajid
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.08.2025
Elsevier
Subjects
Online AccessGet full text
ISSN1110-0168
2090-2670
DOI10.1016/j.aej.2025.06.042

Cover

More Information
Summary:Existing Siamese and Transformer-based trackers commonly approach Visual Object Tracking (VOT) as a one-shot detection problem, where the target object is located in a single forward evaluation. While effective, this approach lacks self-correction mechanisms, making these trackers prone to drifting toward visually similar distractors. To overcome these limitations, we reframe VOT as a spatio-temporal region prediction and segmentation task. In this work, we propose Stable-SAM2, a novel two-stage tracking framework that combines spatio-temporal region prediction with segmentation. In the first stage, we optimize the text embeddings in Stable Diffusion to enforce consistent attention to the target’s spatio-temporal regions by maximizing the cross-attention responses at the query location across frames. These optimized embeddings are used to generate spatio-temporal attention maps, highlighting the target object across video frames. In the second stage, the predicted regions are input into the Segment Anything Model 2 (SAM2), which refines them into accurate per-frame segmentation masks. These masks are then converted into bounding boxes to facilitate VOT. We evaluate Stable-SAM2 on six widely recognized and diverse benchmarks, including LaSOT, LaSOText, TrackingNet, TNL2K, OTB99-Lang, and GOT-10k. Extensive experiments demonstrate that Stable-SAM2 delivers superior and competitive performance compared to supervised state-of-the-art (SOTA) trackers, all without relying on complex VOT-specific training paradigms or large-scale training datasets. The source code of the Stable-SAM2 is publicly available at: https://github.com/HamadYA/Stable-SAM2.
ISSN:1110-0168
2090-2670
DOI:10.1016/j.aej.2025.06.042