Abstract:Transformer-based multi-object tracking (MOT) methods have captured the attention of many researchers in recent years. However, these models often suffer from slow inference speeds due to their structure or other issues. To address this problem, we revisited the Joint Detection and Tracking (JDT) method by looking back at past approaches. By integrating the original JDT approach with some advanced theories, this paper employs an efficient method of information transfer between frames on the DETR, constructing a fast and novel JDT-type MOT framework: FastTrackTr. Thanks to the superiority of this information transfer method, our approach not only reduces the number of queries required during tracking but also avoids the excessive introduction of network structures, ensuring model simplicity. Experimental results indicate that our method has the potential to achieve real-time tracking and exhibits competitive tracking accuracy across multiple datasets.
Abstract:Multi-object tracking is advancing through two dominant paradigms: traditional tracking by detection and newly emerging tracking by query. In this work, we fuse them together and propose the tracking-by-detection-and-query paradigm, which is achieved by a Learnable Associator. Specifically, the basic information interaction module and the content-position alignment module are proposed for thorough information Interaction among object queries. Tracking results are directly Decoded from these queries. Hence, we name the method as LAID. Compared to tracking-by-query models, LAID achieves competitive tracking accuracy with notably higher training efficiency. With regard to tracking-by-detection methods, experimental results on DanceTrack show that LAID significantly surpasses the state-of-the-art heuristic method by 3.9% on HOTA metric and 6.1% on IDF1 metric. On SportsMOT, LAID also achieves the best score on HOTA metric. By holding low training cost, strong tracking capabilities, and an elegant end-to-end approach all at once, LAID presents a forward-looking direction for the field.
Abstract:Analyzing student actions is an important and challenging task in educational research. Existing efforts have been hampered by the lack of accessible datasets to capture the nuanced action dynamics in classrooms. In this paper, we present a new multi-label student action video (SAV) dataset for complex classroom scenes. The dataset consists of 4,324 carefully trimmed video clips from 758 different classrooms, each labeled with 15 different actions displayed by students in classrooms. Compared to existing behavioral datasets, our dataset stands out by providing a wide range of real classroom scenarios, high-quality video data, and unique challenges, including subtle movement differences, dense object engagement, significant scale differences, varied shooting angles, and visual occlusion. The increased complexity of the dataset brings new opportunities and challenges for benchmarking action detection. Innovatively, we also propose a new baseline method, a visual transformer for enhancing attention to key local details in small and dense object regions. Our method achieves excellent performance with mean Average Precision (mAP) of 67.9\% and 27.4\% on SAV and AVA, respectively. This paper not only provides the dataset but also calls for further research into AI-driven educational tools that may transform teaching methodologies and learning outcomes. The code and dataset will be released at https://github.com/Ritatanz/SAV.
Abstract:The goal of image cropping is to identify visually appealing crops within an image. Conventional methods rely on specialized architectures trained on specific datasets, which struggle to be adapted to new requirements. Recent breakthroughs in large vision-language models (VLMs) have enabled visual in-context learning without explicit training. However, effective strategies for vision downstream tasks with VLMs remain largely unclear and underexplored. In this paper, we propose an effective approach to leverage VLMs for better image cropping. First, we propose an efficient prompt retrieval mechanism for image cropping to automate the selection of in-context examples. Second, we introduce an iterative refinement strategy to iteratively enhance the predicted crops. The proposed framework, named Cropper, is applicable to a wide range of cropping tasks, including free-form cropping, subject-aware cropping, and aspect ratio-aware cropping. Extensive experiments and a user study demonstrate that Cropper significantly outperforms state-of-the-art methods across several benchmarks.
Abstract:Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute's relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).
Abstract:Integrated sensing and communication (ISAC) emerges as an essential technique for overcoming spectrum congestion. However, the performance of traditional ISAC systems with fixed-position-antennas (FPA) is limited due to insufficient spatial degree of freedom (DoF) exploration. Recently, fluid antenna (FA) with reconfigurable antenna position is developed to enhance the sensing and communication performance by reshaping the channel. This paper investigates an FA-enhanced ISAC system where a base station is equipped with multiple FAs to communicate with multiple single-antenna users and with FPAs to sense a point target. In this paper, we consider both perfect and imperfect channel state information (CSI) of the communication channel and sensing channel. In two cases, we focus on the maximization of the sensing signal-to-noise (SNR) by optimizing the positions of FAs and the dual-functional beamforming under the constraints of the FA moving region, the minimum FA distance and the minimum signal-to-interference-plus-noise (SINR) per user. Specifically, for the ideal case of perfect CSI, an iterative alternating optimization (AO) algorithm is proposed to tackle the formulated problem where the dual-functional beamforming and the FA positions are obtained via semidefinite relaxation (SDR) and successive convex approximation (SCA) techniques. Then, for the imperfect CSI case, we propose an AO-based iterative algorithm where $\mathcal{S}-$Procedure and SCA are applied to obtain the dual-functional beamforming and the FA positions. Furthermore, we analytically and numerically prove the convergence of the proposed algorithms. Numerical results demonstrate the notable gains of the proposed algorithms in the respective cases.
Abstract:Diffusion models generate new samples by progressively decreasing the noise from the initially provided random distribution. This inference procedure generally utilizes a trained neural network numerous times to obtain the final output, creating significant latency and energy consumption on digital electronic hardware such as GPUs. In this study, we demonstrate that the propagation of a light beam through a semi-transparent medium can be programmed to implement a denoising diffusion model on image samples. This framework projects noisy image patterns through passive diffractive optical layers, which collectively only transmit the predicted noise term in the image. The optical transparent layers, which are trained with an online training approach, backpropagating the error to the analytical model of the system, are passive and kept the same across different steps of denoising. Hence this method enables high-speed image generation with minimal power consumption, benefiting from the bandwidth and energy efficiency of optical information processing.
Abstract:This paper investigates a fluid antenna (FA) enhanced integrated sensing and communication (ISAC) system consisting of a base station (BS), multiple single-antenna communication users, and one point target, where the BS is equipped with FAs to enhance both the communication and sensing performance. First, we formulate a problem that maximizes the radar signal-to-noise ratio (SNR) by jointly optimizing the FAs' positions and transmit beamforming matrix. Then, to tackle this highly non-convex problem, we present efficient algorithms by using alternating optimization (AO), successive convex approximation (SCA), and semi-definite relaxation (SDR). Numerical results demonstrate the convergence behavior and effectiveness of the proposed algorithm.
Abstract:Monocular vision-based 3D object detection is crucial in various sectors, yet existing methods face significant challenges in terms of accuracy and computational efficiency. Building on the successful strategies in 2D detection and depth estimation, we propose MonoDETRNext, which seeks to optimally balance precision and processing speed. Our methodology includes the development of an efficient hybrid visual encoder, enhancement of depth prediction mechanisms, and introduction of an innovative query generation strategy, augmented by an advanced depth predictor. Building on MonoDETR, MonoDETRNext introduces two variants: MonoDETRNext-F, which emphasizes speed, and MonoDETRNext-A, which focuses on precision. We posit that MonoDETRNext establishes a new benchmark in monocular 3D object detection and opens avenues for future research. We conducted an exhaustive evaluation demonstrating the model's superior performance against existing solutions. Notably, MonoDETRNext-A demonstrated a 4.60% improvement in the AP3D metric on the KITTI test benchmark over MonoDETR, while MonoDETRNext-F showed a 2.21% increase. Additionally, the computational efficiency of MonoDETRNext-F slightly exceeds that of its predecessor.
Abstract:The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains, but protecting their copyrights has not yet been researched in depth. Recently, NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However, existing methods are designed to apply only to implicit or explicit NeRF representations. In this work, we introduce an innovative watermarking method that can be employed in both representations of NeRF. This is achieved by fine-tuning NeRF to embed binary messages in the rendering process. In detail, we propose utilizing the discrete wavelet transform in the NeRF space for watermarking. Furthermore, we adopt a deferred back-propagation technique and introduce a combination with the patch-wise loss to improve rendering quality and bit accuracy with minimum trade-offs. We evaluate our method in three different aspects: capacity, invisibility, and robustness of the embedded watermarks in the 2D-rendered images. Our method achieves state-of-the-art performance with faster training speed over the compared state-of-the-art methods.