Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Towards Universal Modal Tracking with Online Dense Temporal Token Learning

Jul 27, 2025

Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, Rongrong Ji

Figure 1 for Towards Universal Modal Tracking with Online Dense Temporal Token Learning

Figure 2 for Towards Universal Modal Tracking with Online Dense Temporal Token Learning

Figure 3 for Towards Universal Modal Tracking with Online Dense Temporal Token Learning

Figure 4 for Towards Universal Modal Tracking with Online Dense Temporal Token Learning

Share this with someone who'll enjoy it:

Abstract:We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called {\modaltracker}). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: \textbf{Video-level Sampling}. We expand the model's inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. \textbf{Video-level Association}. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. \textbf{Modality Scalable}. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our {\modaltracker} achieves a new \textit{SOTA} performance. The code will be available at https://github.com/GXNU-ZhongLab/ODTrack.

* arXiv admin note: text overlap with arXiv:2401.01686

View paper on

Share this with someone who'll enjoy it:

Title:Towards Universal Modal Tracking with Online Dense Temporal Token Learning

Paper and Code