ELTrack: Events-Language Description for Visual Object Tracking

The integration of Natural Language (NL) descriptions into Visual Object Tracking (VOT) has shown promise in enhancing the performance of RGB-based tracking by providing richer, contextually aware information that helps to address issues like appearance variations, model drift, and ambiguous target...

Full description

Saved in:
Bibliographic Details
Published inIEEE access Vol. 13; pp. 31351 - 31367
Main Authors Alansari, Mohamad, Alnuaimi, Khaled, Alansari, Sara, Javed, Sajid
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN2169-3536
2169-3536
DOI10.1109/ACCESS.2025.3540445

Cover

More Information
Summary:The integration of Natural Language (NL) descriptions into Visual Object Tracking (VOT) has shown promise in enhancing the performance of RGB-based tracking by providing richer, contextually aware information that helps to address issues like appearance variations, model drift, and ambiguous target representation. However, the growing complexity of VOT tasks, particularly in scenarios involving fast-moving objects and challenging lighting conditions, necessitates the development of more robust and adaptable tracking frameworks. Traditional visual trackers, which rely solely on RGB data, often struggle with these challenges. Event cameras offer promising solutions that capture changes in a scene as they happen because of their reduced latency, making them highly effective in scenarios where traditional visible cameras often struggle, such as in low-light environments or when tracking rapid motion. Despite the process in events and NL tracking, the fusion of events and NL remains underexplored due to the lack of large-scale NL-described datasets and event-based benchmarks. To address these gaps, we present ELTrack, a novel multi-modal NL-based VOT framework that, to the best of our knowledge, is the first to integrate event data with NL descriptions in VOT. ELTrack synthesizes event data, filters out noise, and applies imprinting and a step decay function to introduce a novel event image representation called Pseudo-Frames. Additionally, we generate NL descriptions using a Visual-Language (VL) image-captioning module featuring BLIP-2 and GPT-4. These modalities are seamlessly integrated using a superimpose fusion module to enhance tracking performance. Our ELTrack is a generic pipeline and can be integrated with any of the existing SoTA trackers. Extensive experiments demonstrate that ELTrack achieves significantly better performance across a variety of publicly available VOT datasets. The source code of the ELTrack is publicly available at: https://github.com/HamadYA/ELTrack-Correlating-Events-and-Language-for-Visual-Tracking .
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2025.3540445