ELTrack: Events-Language Description for Visual Object Tracking

The integration of Natural Language (NL) descriptions into Visual Object Tracking (VOT) has shown promise in enhancing the performance of RGB-based tracking by providing richer, contextually aware information that helps to address issues like appearance variations, model drift, and ambiguous target...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 13; pp. 31351 - 31367
Main Authors	Alansari, Mohamad, Alnuaimi, Khaled, Alansari, Sara, Javed, Sajid
Format	Journal Article
Language	English
Published	Piscataway IEEE 2025 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Benchmark testing Cameras Datasets Descriptions Event detection Events camera Filters Image filters Lighting Modules Moving object recognition multi-modal fusion neuromorphic vision Object tracking Optical tracking Performance enhancement Representations Source code Target tracking Task complexity Tracking Transformers visual object tracking (VOT) visual-language object tracking Visualization Voltage control
Online Access	Get full text
ISSN	2169-3536 2169-3536
DOI	10.1109/ACCESS.2025.3540445

Cover

More Information
Summary:	The integration of Natural Language (NL) descriptions into Visual Object Tracking (VOT) has shown promise in enhancing the performance of RGB-based tracking by providing richer, contextually aware information that helps to address issues like appearance variations, model drift, and ambiguous target representation. However, the growing complexity of VOT tasks, particularly in scenarios involving fast-moving objects and challenging lighting conditions, necessitates the development of more robust and adaptable tracking frameworks. Traditional visual trackers, which rely solely on RGB data, often struggle with these challenges. Event cameras offer promising solutions that capture changes in a scene as they happen because of their reduced latency, making them highly effective in scenarios where traditional visible cameras often struggle, such as in low-light environments or when tracking rapid motion. Despite the process in events and NL tracking, the fusion of events and NL remains underexplored due to the lack of large-scale NL-described datasets and event-based benchmarks. To address these gaps, we present ELTrack, a novel multi-modal NL-based VOT framework that, to the best of our knowledge, is the first to integrate event data with NL descriptions in VOT. ELTrack synthesizes event data, filters out noise, and applies imprinting and a step decay function to introduce a novel event image representation called Pseudo-Frames. Additionally, we generate NL descriptions using a Visual-Language (VL) image-captioning module featuring BLIP-2 and GPT-4. These modalities are seamlessly integrated using a superimpose fusion module to enhance tracking performance. Our ELTrack is a generic pipeline and can be integrated with any of the existing SoTA trackers. Extensive experiments demonstrate that ELTrack achieves significantly better performance across a variety of publicly available VOT datasets. The source code of the ELTrack is publicly available at: https://github.com/HamadYA/ELTrack-Correlating-Events-and-Language-for-Visual-Tracking .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2025.3540445