Visual tracking by matching points using diffusion model

Existing Siamese and Transformer-based trackers commonly approach Visual Object Tracking (VOT) as a one-shot detection problem, where the target object is located in a single forward evaluation. While effective, this approach lacks self-correction mechanisms, making these trackers prone to drifting...

Full description

Saved in:

Bibliographic Details
Published in	Alexandria engineering journal Vol. 127; pp. 787 - 803
Main Authors	Alansari, Mohamad, Ganapathi, Iyyakutti Iyappan, Alansari, Sara, Al Marzouqi, Hasan, Javed, Sajid
Format	Journal Article
Language	English
Published	Elsevier B.V 01.08.2025 Elsevier
Subjects	Diffusion models Segment Anything 2 (SAM2) Segmentation Visual Object Tracking (VOT) Segment Anything 2 (SAM2) Diffusion models Visual Object Tracking (VOT) Segmentation
Online Access	Get full text
ISSN	1110-0168 2090-2670
DOI	10.1016/j.aej.2025.06.042

Cover

More Information
Summary:	Existing Siamese and Transformer-based trackers commonly approach Visual Object Tracking (VOT) as a one-shot detection problem, where the target object is located in a single forward evaluation. While effective, this approach lacks self-correction mechanisms, making these trackers prone to drifting toward visually similar distractors. To overcome these limitations, we reframe VOT as a spatio-temporal region prediction and segmentation task. In this work, we propose Stable-SAM2, a novel two-stage tracking framework that combines spatio-temporal region prediction with segmentation. In the first stage, we optimize the text embeddings in Stable Diffusion to enforce consistent attention to the target’s spatio-temporal regions by maximizing the cross-attention responses at the query location across frames. These optimized embeddings are used to generate spatio-temporal attention maps, highlighting the target object across video frames. In the second stage, the predicted regions are input into the Segment Anything Model 2 (SAM2), which refines them into accurate per-frame segmentation masks. These masks are then converted into bounding boxes to facilitate VOT. We evaluate Stable-SAM2 on six widely recognized and diverse benchmarks, including LaSOT, LaSOText, TrackingNet, TNL2K, OTB99-Lang, and GOT-10k. Extensive experiments demonstrate that Stable-SAM2 delivers superior and competitive performance compared to supervised state-of-the-art (SOTA) trackers, all without relying on complex VOT-specific training paradigms or large-scale training datasets. The source code of the Stable-SAM2 is publicly available at: https://github.com/HamadYA/Stable-SAM2.
ISSN:	1110-0168 2090-2670
DOI:	10.1016/j.aej.2025.06.042