Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yanning Zhang

3D Single-object Tracking in Point Clouds with High Temporal Variation

Aug 04, 2024

Qiao Wu, Kun Sun, Pei An, Mathieu Salzmann, Yanning Zhang, Jiaqi Yang

Figure 1 for 3D Single-object Tracking in Point Clouds with High Temporal Variation

Figure 2 for 3D Single-object Tracking in Point Clouds with High Temporal Variation

Figure 3 for 3D Single-object Tracking in Point Clouds with High Temporal Variation

Figure 4 for 3D Single-object Tracking in Point Clouds with High Temporal Variation

Abstract:The high temporal variation of the point clouds is the key challenge of 3D single-object tracking (3D SOT). Existing approaches rely on the assumption that the shape variation of the point clouds and the motion of the objects across neighboring frames are smooth, failing to cope with high temporal variation data. In this paper, we present a novel framework for 3D SOT in point clouds with high temporal variation, called HVTrack. HVTrack proposes three novel components to tackle the challenges in the high temporal variation scenario: 1) A Relative-Pose-Aware Memory module to handle temporal point cloud shape variations; 2) a Base-Expansion Feature Cross-Attention module to deal with similar object distractions in expanded search areas; 3) a Contextual Point Guided Self-Attention module for suppressing heavy background noise. We construct a dataset with high temporal variation (KITTI-HV) by setting different frame intervals for sampling in the KITTI dataset. On the KITTI-HV with 5 frame intervals, our HVTrack surpasses the state-of-the-art tracker CXTracker by 11.3%/15.7% in Success/Precision.

* Accepted by ECCV24

Via

Access Paper or Ask Questions

Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

Aug 01, 2024

Congqi Cao, Yueran Zhang, Yating Yu, Qinyi Lv, Lingtong Min, Yanning Zhang

Figure 1 for Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

Figure 2 for Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

Figure 3 for Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

Figure 4 for Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

Abstract:Existing works in few-shot action recognition mostly fine-tune a pre-trained image model and design sophisticated temporal alignment modules at feature level. However, simply fully fine-tuning the pre-trained model could cause overfitting due to the scarcity of video samples. Additionally, we argue that the exploration of task-specific information is insufficient when relying solely on well extracted abstract features. In this work, we propose a simple but effective task-specific adaptation method (Task-Adapter) for few-shot action recognition. By introducing the proposed Task-Adapter into the last several layers of the backbone and keeping the parameters of the original pre-trained model frozen, we mitigate the overfitting problem caused by full fine-tuning and advance the task-specific mechanism into the process of feature extraction. In each Task-Adapter, we reuse the frozen self-attention layer to perform task-specific self-attention across different videos within the given task to capture both distinctive information among classes and shared information within classes, which facilitates task-specific adaptation and enhances subsequent metric measurement between the query feature and support prototypes. Experimental results consistently demonstrate the effectiveness of our proposed Task-Adapter on four standard few-shot action recognition datasets. Especially on temporal challenging SSv2 dataset, our method outperforms the state-of-the-art methods by a large margin.

* Accepted by ACM MM2024

Via

Access Paper or Ask Questions

A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap

Jul 31, 2024

Lijun Zhang, Wei Suo, Peng Wang, Yanning Zhang

Figure 1 for A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap

Figure 2 for A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap

Figure 3 for A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap

Figure 4 for A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap

Abstract:Human-object interactions (HOI) detection aims at capturing human-object pairs in images and corresponding actions. It is an important step toward high-level visual reasoning and scene understanding. However, due to the natural bias from the real world, existing methods mostly struggle with rare human-object pairs and lead to sub-optimal results. Recently, with the development of the generative model, a straightforward approach is to construct a more balanced dataset based on a group of supplementary samples. Unfortunately, there is a significant domain gap between the generated data and the original data, and simply merging the generated images into the original dataset cannot significantly boost the performance. To alleviate the above problem, we present a novel model-agnostic framework called \textbf{C}ontext-\textbf{E}nhanced \textbf{F}eature \textbf{A}lignment (CEFA) module, which can effectively align the generated data with the original data at the feature level and bridge the domain gap. Specifically, CEFA consists of a feature alignment module and a context enhancement module. On one hand, considering the crucial role of human-object pairs information in HOI tasks, the feature alignment module aligns the human-object pairs by aggregating instance information. On the other hand, to mitigate the issue of losing important context information caused by the traditional discriminator-style alignment method, we employ a context-enhanced image reconstruction module to improve the model's learning ability of contextual cues. Extensive experiments have shown that our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories\footnote{https://github.com/LijunZhang01/CEFA}.

Via

Access Paper or Ask Questions

Visual Prompt Selection for In-Context Learning Segmentation

Jul 14, 2024

Wei Suo, Lanqing Lai, Mengyang Sun, Hanwang Zhang, Peng Wang, Yanning Zhang

Abstract:As a fundamental and extensively studied task in computer vision, image segmentation aims to locate and identify different semantic concepts at the pixel level. Recently, inspired by In-Context Learning (ICL), several generalist segmentation frameworks have been proposed, providing a promising paradigm for segmenting specific objects. However, existing works mostly ignore the value of visual prompts or simply apply similarity sorting to select contextual examples. In this paper, we focus on rethinking and improving the example selection strategy. By comprehensive comparisons, we first demonstrate that ICL-based segmentation models are sensitive to different contexts. Furthermore, empirical evidence indicates that the diversity of contextual prompts plays a crucial role in guiding segmentation. Based on the above insights, we propose a new stepwise context search method. Different from previous works, we construct a small yet rich candidate pool and adaptively search the well-matched contexts. More importantly, this method effectively reduces the annotation cost by compacting the search space. Extensive experiments show that our method is an effective strategy for selecting examples and enhancing segmentation performance.

* Accept by ECCV2024

Via

Access Paper or Ask Questions

Indoor 3D Reconstruction with an Unknown Camera-Projector Pair

Jul 02, 2024

Zhaoshuai Qi, Yifeng Hao, Rui Hu, Wenyou Chang, Jiaqi Yang, Yanning Zhang

Figure 1 for Indoor 3D Reconstruction with an Unknown Camera-Projector Pair

Figure 2 for Indoor 3D Reconstruction with an Unknown Camera-Projector Pair

Figure 3 for Indoor 3D Reconstruction with an Unknown Camera-Projector Pair

Figure 4 for Indoor 3D Reconstruction with an Unknown Camera-Projector Pair

Abstract:Structured light-based method with a camera-projector pair (CPP) plays a vital role in indoor 3D reconstruction, especially for scenes with weak textures. Previous methods usually assume known intrinsics, which are pre-calibrated from known objects, or self-calibrated from multi-view observations. It is still challenging to reliably recover CPP intrinsics from only two views without any known objects. In this paper, we provide a simple yet reliable solution. We demonstrate that, for the first time, sufficient constraints on CPP intrinsics can be derived from an unknown cuboid corner (C2), e.g. a room's corner, which is a common structure in indoor scenes. In addition, with only known camera principal point, the complex multi-variable estimation of all CPP intrinsics can be simplified to a simple univariable optimization problem, leading to reliable calibration and thus direct 3D reconstruction with unknown CPP. Extensive results have demonstrated the superiority of the proposed method over both traditional and learning-based counterparts. Furthermore, the proposed method also demonstrates impressive potential to solve similar tasks without active lighting, such as sparse-view structure from motion.

Via

Access Paper or Ask Questions

Visual Prompt Tuning in Null Space for Continual Learning

Jun 11, 2024

Yue Lu, Shizhou Zhang, De Cheng, Yinghui Xing, Nannan Wang, Peng Wang, Yanning Zhang

Figure 1 for Visual Prompt Tuning in Null Space for Continual Learning

Figure 2 for Visual Prompt Tuning in Null Space for Continual Learning

Figure 3 for Visual Prompt Tuning in Null Space for Continual Learning

Figure 4 for Visual Prompt Tuning in Null Space for Continual Learning

Abstract:Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL), by selecting and updating relevant prompts in the vision-transformer models. On the contrary, this paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features, so as to ensure no interference on tasks that have been learned to overcome catastrophic forgetting in CL. However, different from the orthogonal projection in the traditional CNN architecture, the prompt gradient orthogonal projection in the ViT architecture shows completely different and greater challenges, i.e., 1) the high-order and non-linear self-attention operation; 2) the drift of prompt distribution brought by the LayerNorm in the transformer block. Theoretically, we have finally deduced two consistency conditions to achieve the prompt gradient orthogonal projection, which provide a theoretical guarantee of eliminating interference on previously learned knowledge via the self-attention mechanism in visual prompt tuning. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient orthogonal projection. Extensive experimental results demonstrate the effectiveness of anti-forgetting on four class-incremental benchmarks with diverse pre-trained baseline models, and our approach achieves superior performances to state-of-the-art methods. Our code is available at https://github.com/zugexiaodui/VPTinNSforCL.

* 20 pages, 10 figures

Via

Access Paper or Ask Questions

Multi-Granularity Language-Guided Multi-Object Tracking

Jun 07, 2024

Yuhao Li, Muzammal Naseer, Jiale Cao, Yu Zhu, Jinqiu Sun, Yanning Zhang, Fahad Shahbaz Khan

Figure 1 for Multi-Granularity Language-Guided Multi-Object Tracking

Figure 2 for Multi-Granularity Language-Guided Multi-Object Tracking

Figure 3 for Multi-Granularity Language-Guided Multi-Object Tracking

Figure 4 for Multi-Granularity Language-Guided Multi-Object Tracking

Abstract:Most existing multi-object tracking methods typically learn visual tracking features via maximizing dis-similarities of different instances and minimizing similarities of the same instance. While such a feature learning scheme achieves promising performance, learning discriminative features solely based on visual information is challenging especially in case of environmental interference such as occlusion, blur and domain variance. In this work, we argue that multi-modal language-driven features provide complementary information to classical visual features, thereby aiding in improving the robustness to such environmental interference. To this end, we propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity (scene-and instance-level) and combines it with standard visual features to obtain discriminative representations. To develop LG-MOT, we annotate existing MOT datasets with scene-and instance-level language descriptions. We then encode both instance-and scene-level language information into high-dimensional embeddings, which are utilized to guide the visual features during training. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Extensive experiments on three benchmarks, MOT17, DanceTrack and SportsMOT, reveal the merits of the proposed contributions leading to state-of-the-art performance. On the DanceTrack test set, our LG-MOT achieves an absolute gain of 2.2\% in terms of target object association (IDF1 score), compared to the baseline using only visual features. Further, our LG-MOT exhibits strong cross-domain generalizability. The dataset and code will be available at ~\url{https://github.com/WesLee88524/LG-MOT}.

Via

Access Paper or Ask Questions

RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

May 29, 2024

Jinzhong Wang, Xuetao Tian, Shun Dai, Tao Zhuo, Haorui Zeng, Hongjuan Liu, Jiaqi Liu, Xiuwei Zhang, Yanning Zhang

Figure 1 for RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

Figure 2 for RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

Figure 3 for RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

Figure 4 for RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

Abstract:Multispectral object detection, utilizing both visible (RGB) and thermal infrared (T) modals, has garnered significant attention for its robust performance across diverse weather and lighting conditions. However, effectively exploiting the complementarity between RGB-T modals while maintaining efficiency remains a critical challenge. In this paper, a very simple Group Shuffled Multi-receptive Attention (GSMA) module is proposed to extract and combine multi-scale RGB and thermal features. Then, the extracted multi-modal features are directly integrated with a multi-level path aggregation neck, which significantly improves the fusion effect and efficiency. Meanwhile, multi-modal object detection often adopts union annotations for both modals. This kind of supervision is not sufficient and unfair, since objects observed in one modal may not be seen in the other modal. To solve this issue, Multi-modal Supervision (MS) is proposed to sufficiently supervise RGB-T object detection. Comprehensive experiments on two challenging benchmarks, KAIST and DroneVehicle, demonstrate the proposed model achieves the state-of-the-art accuracy while maintaining competitive efficiency.

Via

Access Paper or Ask Questions

C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning

May 21, 2024

Ji Ma, Wei Suo, Peng Wang, Yanning Zhang

Abstract:Vision-Language Instruction Tuning (VLIT) is a critical training phase for Large Vision-Language Models (LVLMs). With the improving capabilities of open-source LVLMs, researchers have increasingly turned to generate VLIT data by using open-source LVLMs and achieved significant progress. However, such data generation approaches are bottlenecked by the following challenges: 1) Since multi-modal models tend to be influenced by prior language knowledge, directly using LVLMs to generate VLIT data would inevitably lead to low content relevance between generated data and images. 2) To improve the ability of the models to generate VLIT data, previous methods have incorporated an additional training phase to boost the generative capacity. This process hurts the generalization of the models to unseen inputs (i.e., "exposure bias" problem). In this paper, we propose a new Content Correlated VLIT data generation via Contrastive Learning (C3L). Specifically, we design a new content relevance module which enhances the content relevance between VLIT data and images by computing Image Instruction Correspondence Scores S(I2C). Moreover, a contrastive learning module is introduced to further boost the VLIT data generation capability of the LVLMs. A large number of automatic measures on four benchmarks show the effectiveness of our method.

* Accepted by IJCAI-24

Via

Access Paper or Ask Questions

Dual-Modal Prompting for Sketch-Based Image Retrieval

Apr 29, 2024

Liying Gao, Bingliang Jiao, Peng Wang, Shizhou Zhang, Hanwang Zhang, Yanning Zhang

Figure 1 for Dual-Modal Prompting for Sketch-Based Image Retrieval

Figure 2 for Dual-Modal Prompting for Sketch-Based Image Retrieval

Figure 3 for Dual-Modal Prompting for Sketch-Based Image Retrieval

Figure 4 for Dual-Modal Prompting for Sketch-Based Image Retrieval

Abstract:Sketch-based image retrieval (SBIR) associates hand-drawn sketches with their corresponding realistic images. In this study, we aim to tackle two major challenges of this task simultaneously: i) zero-shot, dealing with unseen categories, and ii) fine-grained, referring to intra-category instance-level retrieval. Our key innovation lies in the realization that solely addressing this cross-category and fine-grained recognition task from the generalization perspective may be inadequate since the knowledge accumulated from limited seen categories might not be fully valuable or transferable to unseen target categories. Inspired by this, in this work, we propose a dual-modal prompting CLIP (DP-CLIP) network, in which an adaptive prompting strategy is designed. Specifically, to facilitate the adaptation of our DP-CLIP toward unpredictable target categories, we employ a set of images within the target category and the textual category label to respectively construct a set of category-adaptive prompt tokens and channel scales. By integrating the generated guidance, DP-CLIP could gain valuable category-centric insights, efficiently adapting to novel categories and capturing unique discriminative clues for effective retrieval within each target category. With these designs, our DP-CLIP outperforms the state-of-the-art fine-grained zero-shot SBIR method by 7.3% in Acc.@1 on the Sketchy dataset. Meanwhile, in the other two category-level zero-shot SBIR benchmarks, our method also achieves promising performance.

Via

Access Paper or Ask Questions