IEEE Access • Vol 13
AITtrack: Attention-Based Image-Text Alignment for Visual Tracking
January 2025 • Basit Alawode, Sajid Javed
Vision-Language Models (VLMs) have recently advanced the Visual Object Tracking (VOT) performance. In VLMs, a vision encoder is employed to obtain visual representation, and a text encoder is employed to estimate the textual embeddings using natural language descriptions. By aligning the visual and textual representations, the VLMs achieve robust performance in complex and diverse tracking scenarios, efficiently handling dynamic target appearances such as motion blur, occlusion, fast motion, and similar object dis…