Topic:Multiple Object Tracking
What is Multiple Object Tracking? Multiple object tracking is the process of tracking and following multiple objects in a video sequence.
Papers and Code
Apr 23, 2025
Abstract:Multiobject tracking provides situational awareness that enables new applications for modern convenience, applied ocean sciences, public safety, and homeland security. In many multiobject tracking applications, including radar and sonar tracking, after coherent prefiltering of the received signal, measurement data is typically structured in cells, where each cell represent, e.g., a different range and bearing value. While conventional detect-then-track (DTT) multiobject tracking approaches convert the cell-structured data within a detection phase into so-called point measurements in order to reduce the amount of data, track-before-detect (TBD) methods process the cell-structured data directly, avoiding a potential information loss. However, many TBD tracking methods are computationally intensive and achieve a reduced tracking accuracy when objects interact, i.e., when they come into close proximity. We here counteract these difficulties by introducing the concept of probabilistic object-to-cell contributions. As many conventional DTT methods, our approach uses a probabilistic association of objects with data cells, and a new object contribution model with corresponding object contribution probabilities to further associate cell contributions to objects that occupy the same data cell. Furthermore, to keep the computational complexity and filter runtimes low, we here use an efficient Poisson multi-Bernoulli filtering approach in combination with the application of belief propagation for fast probabilistic data association. We demonstrate numerically that our method achieves significantly increased tracking performance compared to state-of-the-art TBD tracking approaches, where performance differences are particularly pronounced when multiple objects interact.
* 13 pages, submitted to IEEE Transactions on Signal Processing
Via

Apr 24, 2025
Abstract:Accurate 3D trajectory data is crucial for advancing autonomous driving. Yet, traditional datasets are usually captured by fixed sensors mounted on a car and are susceptible to occlusion. Additionally, such an approach can precisely reconstruct the dynamic environment in the close vicinity of the measurement vehicle only, while neglecting objects that are further away. In this paper, we introduce the DeepScenario Open 3D Dataset (DSC3D), a high-quality, occlusion-free dataset of 6 degrees of freedom bounding box trajectories acquired through a novel monocular camera drone tracking pipeline. Our dataset includes more than 175,000 trajectories of 14 types of traffic participants and significantly exceeds existing datasets in terms of diversity and scale, containing many unprecedented scenarios such as complex vehicle-pedestrian interaction on highly populated urban streets and comprehensive parking maneuvers from entry to exit. DSC3D dataset was captured in five various locations in Europe and the United States and include: a parking lot, a crowded inner-city, a steep urban intersection, a federal highway, and a suburban intersection. Our 3D trajectory dataset aims to enhance autonomous driving systems by providing detailed environmental 3D representations, which could lead to improved obstacle interactions and safety. We demonstrate its utility across multiple applications including motion prediction, motion planning, scenario mining, and generative reactive traffic agents. Our interactive online visualization platform and the complete dataset are publicly available at app.deepscenario.com, facilitating research in motion prediction, behavior modeling, and safety validation.
Via

Apr 12, 2025
Abstract:Tracking multiple objects in a continuous video stream is crucial for many computer vision tasks. It involves detecting and associating objects with their respective identities across successive frames. Despite significant progress made in multiple object tracking (MOT), recent studies have revealed the vulnerability of existing MOT methods to adversarial attacks. Nevertheless, all of these attacks belong to digital attacks that inject pixel-level noise into input images, and are therefore ineffective in physical scenarios. To fill this gap, we propose PapMOT, which can generate physical adversarial patches against MOT for both digital and physical scenarios. Besides attacking the detection mechanism, PapMOT also optimizes a printable patch that can be detected as new targets to mislead the identity association process. Moreover, we introduce a patch enhancement strategy to further degrade the temporal consistency of tracking results across video frames, resulting in more aggressive attacks. We further develop new evaluation metrics to assess the robustness of MOT against such attacks. Extensive evaluations on multiple datasets demonstrate that our PapMOT can successfully attack various architectures of MOT trackers in digital scenarios. We also validate the effectiveness of PapMOT for physical attacks by deploying printed adversarial patches in the real world.
* Accepted by ECCV 2024
Via

Apr 19, 2025
Abstract:Thermal infrared (TIR) images typically lack detailed features and have low contrast, making it challenging for conventional feature extraction models to capture discriminative target characteristics. As a result, trackers are often affected by interference from visually similar objects and are susceptible to tracking drift. To address these challenges, we propose a novel saliency-guided Siamese network tracker based on key fine-grained feature infor-mation. First, we introduce a fine-grained feature parallel learning convolu-tional block with a dual-stream architecture and convolutional kernels of varying sizes. This design captures essential global features from shallow layers, enhances feature diversity, and minimizes the loss of fine-grained in-formation typically encountered in residual connections. In addition, we propose a multi-layer fine-grained feature fusion module that uses bilinear matrix multiplication to effectively integrate features across both deep and shallow layers. Next, we introduce a Siamese residual refinement block that corrects saliency map prediction errors using residual learning. Combined with deep supervision, this mechanism progressively refines predictions, ap-plying supervision at each recursive step to ensure consistent improvements in accuracy. Finally, we present a saliency loss function to constrain the sali-ency predictions, directing the network to focus on highly discriminative fi-ne-grained features. Extensive experiment results demonstrate that the pro-posed tracker achieves the highest precision and success rates on the PTB-TIR and LSOTB-TIR benchmarks. It also achieves a top accuracy of 0.78 on the VOT-TIR 2015 benchmark and 0.75 on the VOT-TIR 2017 benchmark.
Via

Apr 12, 2025
Abstract:Tracking multiple objects based on textual queries is a challenging task that requires linking language understanding with object association across frames. Previous works typically train the whole process end-to-end or integrate an additional referring text module into a multi-object tracker, but they both require supervised training and potentially struggle with generalization to open-set queries. In this work, we introduce ReferGPT, a novel zero-shot referring multi-object tracking framework. We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions. This enhances its descriptive capabilities and supports a more flexible referring vocabulary without training. We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries. Extensive experiments on Refer-KITTI, Refer-KITTIv2 and Refer-KITTI+ demonstrate that ReferGPT achieves competitive performance against trained methods, showcasing its robustness and zero-shot capabilities in autonomous driving. The codes are available on https://github.com/Tzoulio/ReferGPT
* Accepted CVPR 2025 Workshop on Distillation of Foundation Models for
Autonomous Driving
Via

Apr 09, 2025
Abstract:This paper deals with the multi-object detection and tracking problem, within the scope of open Radio Access Network (RAN), for collision avoidance in vehicular scenarios. To this end, a set of distributed intelligent agents collocated with cameras are considered. The fusion of detected objects is done at an edge service, considering Open RAN connectivity. Then, the edge service predicts the objects trajectories for collision avoidance. Compared to the related work a more realistic Open RAN network is implemented and multiple cameras are used.
Via

Apr 16, 2025
Abstract:Distributed multiple-input multiple-output (MIMO), also known as cell-free massive MIMO, emerges as a promising technology for sixth-generation (6G) systems to support uniform coverage and reliable communication. For the design and optimization of such systems, measurement-based investigations of real-world distributed MIMO channels are essential. In this paper, we present an indoor channel measurement campaign, featuring eight distributed antenna arrays with 128 elements in total. Multi-link channels are measured at 50 positions along a 12-meter user route. A clustering algorithm enabled by interacting objects is proposed to identify clusters in the measured channels. The algorithm jointly clusters the multipath components for all links, effectively capturing the dynamic contributions of common clusters to different links. In addition, a Kalman filter-based tracking framework is introduced for cluster prediction, tracking, and updating along the user movement. Using the clustering and tracking results, cluster-level characterization of the measured channels is performed. First, the number of clusters and their visibility at both link ends are analyzed. Next, a maximum-likelihood estimator is utilized to determine the entire cluster visibility region length. Finally, key cluster-level properties, including the common cluster ratio, cluster power, shadowing, spread, among others, are statistically investigated. The results provide valuable insights into cluster behavior in typical multi-link channels, necessary for accurate modeling of distributed MIMO channels.
* This paper has been submitted to IEEE Transactions on Wireless
Communications. 13 pages, 13 figures, 2 tables
Via

Apr 13, 2025
Abstract:In multi-target tracking and detection tasks, it is necessary to continuously track multiple targets, such as vehicles, pedestrians, etc. To achieve this goal, the system must be able to continuously acquire and process image frames containing these targets. These consecutive frame images enable the algorithm to update the position and state of the target in real-time in each frame of the image. How to accurately associate the detected target with the target in the previous or next frame to form a stable trajectory is a complex problem. Therefore, a multi object tracking and detection method for intelligent driving vehicles based on YOLOv5 and point cloud 3D projection is proposed. Using Retinex algorithm to enhance the image of the environment in front of the vehicle, remove light interference in the image, and build an intelligent detection model based on YOLOv5 network structure. The enhanced image is input into the model, and multiple targets in front of the vehicle are identified through feature extraction and target localization. By combining point cloud 3D projection technology, the correlation between the position changes of adjacent frame images in the projection coordinate system can be inferred. By sequentially projecting the multi-target recognition results of multiple consecutive frame images into the 3D laser point cloud environment, effective tracking of the motion trajectories of all targets in front of the vehicle can be achieved. The experimental results show that the application of this method for intelligent driving vehicle front multi-target tracking and detection yields a MOTA (Tracking Accuracy) value greater than 30, demonstrating its superior tracking and detection performance.
* in Chinese language
Via

Apr 16, 2025
Abstract:Recent advances in sign language research have benefited from CNN-based backbones, which are primarily transferred from traditional computer vision tasks (\eg object identification, image recognition). However, these CNN-based backbones usually excel at extracting features like contours and texture, but may struggle with capturing sign-related features. In fact, sign language tasks require focusing on sign-related regions, including the collaboration between different regions (\eg left hand region and right hand region) and the effective content in a single region. To capture such region-related features, we introduce MixSignGraph, which represents sign sequences as a group of mixed graphs and designs the following three graph modules for feature extraction, \ie Local Sign Graph (LSG) module, Temporal Sign Graph (TSG) module and Hierarchical Sign Graph (HSG) module. Specifically, the LSG module learns the correlation of intra-frame cross-region features within one frame, \ie focusing on spatial features. The TSG module tracks the interaction of inter-frame cross-region features among adjacent frames, \ie focusing on temporal features. The HSG module aggregates the same-region features from different-granularity feature maps of a frame, \ie focusing on hierarchical features. In addition, to further improve the performance of sign language tasks without gloss annotations, we propose a simple yet counter-intuitive Text-driven CTC Pre-training (TCP) method, which generates pseudo gloss labels from text labels for model pre-training. Extensive experiments conducted on current five public sign language datasets demonstrate the superior performance of the proposed model. Notably, our model surpasses the SOTA models on multiple sign language tasks across several datasets, without relying on any additional cues.
* 17 pages, 9 figures, submitted to IEEE Transactions on Pattern
Analysis and Machine Intelligence (T-PAMI). This is a regular paper
submission
Via

Apr 06, 2025
Abstract:Lobula plate/lobula columnar, type 2 (LPLC2) visual projection neurons in the fly's visual system possess highly looming-selective properties, making them ideal for developing artificial collision detection systems. The four dendritic branches of individual LPLC2 neurons, each tuned to specific directional motion, enhance the robustness of looming detection by utilizing radial motion opponency. Existing models of LPLC2 neurons either concentrate on individual cells to detect centroid-focused expansion or utilize population-voting strategies to obtain global collision information. However, their potential for addressing multi-target collision scenarios remains largely untapped. In this study, we propose a numerical model for LPLC2 populations, leveraging a bottom-up attention mechanism driven by motion-sensitive neural pathways to generate attention fields (AFs). This integration of AFs with highly nonlinear LPLC2 responses enables precise and continuous detection of multiple looming objects emanating from any region of the visual field. We began by conducting comparative experiments to evaluate the proposed model against two related models, highlighting its unique characteristics. Next, we tested its ability to detect multiple targets in dynamic natural scenarios. Finally, we validated the model using real-world video data collected by aerial robots. Experimental results demonstrate that the proposed model excels in detecting, distinguishing, and tracking multiple looming targets with remarkable speed and accuracy. This advanced ability to detect and localize looming objects, especially in complex and dynamic environments, holds great promise for overcoming collision-detection challenges in mobile intelligent machines.
Via
