Alert button
Picture for Zhiwen Cao

Zhiwen Cao

Alert button

E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning

Jul 25, 2023
Cheng Han, Qifan Wang, Yiming Cui, Zhiwen Cao, Wenguan Wang, Siyuan Qi, Dongfang Liu

Figure 1 for E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning
Figure 2 for E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning
Figure 3 for E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning
Figure 4 for E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning

As the size of transformer-based models continues to grow, fine-tuning these large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. Parameter-efficient learning has been developed to reduce the number of tunable parameters during fine-tuning. Although these methods show promising results, there is still a significant performance gap compared to full fine-tuning. To address this challenge, we propose an Effective and Efficient Visual Prompt Tuning (E^2VPT) approach for large-scale transformer-based model adaptation. Specifically, we introduce a set of learnable key-value prompts and visual prompts into self-attention and input layers, respectively, to improve the effectiveness of model fine-tuning. Moreover, we design a prompt pruning procedure to systematically prune low importance prompts while preserving model performance, which largely enhances the model's efficiency. Empirical results demonstrate that our approach outperforms several state-of-the-art baselines on two benchmarks, with considerably low parameter usage (e.g., 0.32% of model parameters on VTAB-1k). Our code is available at https://github.com/ChengHan111/E2VPT.

* 12 pages, 4 figures 
Viaarxiv icon

Towards Unbiased Label Distribution Learning for Facial Pose Estimation Using Anisotropic Spherical Gaussian

Aug 19, 2022
Zhiwen Cao, Dongfang Liu, Qifan Wang, Yingjie Chen

Figure 1 for Towards Unbiased Label Distribution Learning for Facial Pose Estimation Using Anisotropic Spherical Gaussian
Figure 2 for Towards Unbiased Label Distribution Learning for Facial Pose Estimation Using Anisotropic Spherical Gaussian
Figure 3 for Towards Unbiased Label Distribution Learning for Facial Pose Estimation Using Anisotropic Spherical Gaussian
Figure 4 for Towards Unbiased Label Distribution Learning for Facial Pose Estimation Using Anisotropic Spherical Gaussian

Facial pose estimation refers to the task of predicting face orientation from a single RGB image. It is an important research topic with a wide range of applications in computer vision. Label distribution learning (LDL) based methods have been recently proposed for facial pose estimation, which achieve promising results. However, there are two major issues in existing LDL methods. First, the expectations of label distributions are biased, leading to a biased pose estimation. Second, fixed distribution parameters are applied for all learning samples, severely limiting the model capability. In this paper, we propose an Anisotropic Spherical Gaussian (ASG)-based LDL approach for facial pose estimation. In particular, our approach adopts the spherical Gaussian distribution on a unit sphere which constantly generates unbiased expectation. Meanwhile, we introduce a new loss function that allows the network to learn the distribution parameter for each learning sample flexibly. Extensive experimental results show that our method sets new state-of-the-art records on AFLW2000 and BIWI datasets.

* Proceeding with European Conference on Computer Vision (ECCV, 2022) 
Viaarxiv icon

Physical Attack on Monocular Depth Estimation with Optimal Adversarial Patches

Jul 11, 2022
Zhiyuan Cheng, James Liang, Hongjun Choi, Guanhong Tao, Zhiwen Cao, Dongfang Liu, Xiangyu Zhang

Figure 1 for Physical Attack on Monocular Depth Estimation with Optimal Adversarial Patches
Figure 2 for Physical Attack on Monocular Depth Estimation with Optimal Adversarial Patches
Figure 3 for Physical Attack on Monocular Depth Estimation with Optimal Adversarial Patches
Figure 4 for Physical Attack on Monocular Depth Estimation with Optimal Adversarial Patches

Deep learning has substantially boosted the performance of Monocular Depth Estimation (MDE), a critical component in fully vision-based autonomous driving (AD) systems (e.g., Tesla and Toyota). In this work, we develop an attack against learning-based MDE. In particular, we use an optimization-based method to systematically generate stealthy physical-object-oriented adversarial patches to attack depth estimation. We balance the stealth and effectiveness of our attack with object-oriented adversarial design, sensitive region localization, and natural style camouflage. Using real-world driving scenarios, we evaluate our attack on concurrent MDE models and a representative downstream task for AD (i.e., 3D object detection). Experimental results show that our method can generate stealthy, effective, and robust adversarial patches for different target objects and models and achieves more than 6 meters mean depth estimation error and 93% attack success rate (ASR) in object detection with a patch of 1/9 of the vehicle's rear area. Field tests on three different driving routes with a real vehicle indicate that we cause over 6 meters mean depth estimation error and reduce the object detection rate from 90.70% to 5.16% in continuous video frames.

* ECCV2022 
Viaarxiv icon

DG-Labeler and DGL-MOTS Dataset: Boost the Autonomous Driving Perception

Oct 15, 2021
Yiming Cui, Zhiwen Cao, Yixin Xie, Xingyu Jiang, Feng Tao, Yingjie Chen, Lin Li, Dongfang Liu

Figure 1 for DG-Labeler and DGL-MOTS Dataset: Boost the Autonomous Driving Perception
Figure 2 for DG-Labeler and DGL-MOTS Dataset: Boost the Autonomous Driving Perception
Figure 3 for DG-Labeler and DGL-MOTS Dataset: Boost the Autonomous Driving Perception
Figure 4 for DG-Labeler and DGL-MOTS Dataset: Boost the Autonomous Driving Perception

Multi-object tracking and segmentation (MOTS) is a critical task for autonomous driving applications. The existing MOTS studies face two critical challenges: 1) the published datasets inadequately capture the real-world complexity for network training to address various driving settings; 2) the working pipeline annotation tool is under-studied in the literature to improve the quality of MOTS learning examples. In this work, we introduce the DG-Labeler and DGL-MOTS dataset to facilitate the training data annotation for the MOTS task and accordingly improve network training accuracy and efficiency. DG-Labeler uses the novel Depth-Granularity Module to depict the instance spatial relations and produce fine-grained instance masks. Annotated by DG-Labeler, our DGL-MOTS dataset exceeds the prior effort (i.e., KITTI MOTS and BDD100K) in data diversity, annotation quality, and temporal representations. Results on extensive cross-dataset evaluations indicate significant performance improvements for several state-of-the-art methods trained on our DGL-MOTS dataset. We believe our DGL-MOTS Dataset and DG-Labeler hold the valuable potential to boost the visual perception of future transportation.

Viaarxiv icon

TF-Blender: Temporal Feature Blender for Video Object Detection

Aug 12, 2021
Yiming Cui, Liqi Yan, Zhiwen Cao, Dongfang Liu

Figure 1 for TF-Blender: Temporal Feature Blender for Video Object Detection
Figure 2 for TF-Blender: Temporal Feature Blender for Video Object Detection
Figure 3 for TF-Blender: Temporal Feature Blender for Video Object Detection
Figure 4 for TF-Blender: Temporal Feature Blender for Video Object Detection

Video objection detection is a challenging task because isolated video frames may encounter appearance deterioration, which introduces great confusion for detection. One of the popular solutions is to exploit the temporal information and enhance per-frame representation through aggregating features from neighboring frames. Despite achieving improvements in detection, existing methods focus on the selection of higher-level video frames for aggregation rather than modeling lower-level temporal relations to increase the feature representation. To address this limitation, we propose a novel solution named TF-Blender,which includes three modules: 1) Temporal relation mod-els the relations between the current frame and its neighboring frames to preserve spatial information. 2). Feature adjustment enriches the representation of every neigh-boring feature map; 3) Feature blender combines outputs from the first two modules and produces stronger features for the later detection tasks. For its simplicity, TF-Blender can be effortlessly plugged into any detection network to improve detection behavior. Extensive evaluations on ImageNet VID and YouTube-VIS benchmarks indicate the performance guarantees of using TF-Blender on recent state-of-the-art methods.

Viaarxiv icon

A Vector-based Representation to Enhance Head Pose Estimation

Oct 14, 2020
Zhiwen Cao, Zongcheng Chu, Dongfang Liu, Yingjie Chen

Figure 1 for A Vector-based Representation to Enhance Head Pose Estimation
Figure 2 for A Vector-based Representation to Enhance Head Pose Estimation
Figure 3 for A Vector-based Representation to Enhance Head Pose Estimation
Figure 4 for A Vector-based Representation to Enhance Head Pose Estimation

This paper proposes to use the three vectors in a rotation matrix as the representation in head pose estimation and develops a new neural network based on the characteristic of such representation. We address two potential issues existed in current head pose estimation works: 1. Public datasets for head pose estimation use either Euler angles or quaternions to annotate data samples. However, both of these annotations have the issue of discontinuity and thus could result in some performance issues in neural network training. 2. Most research works report Mean Absolute Error (MAE) of Euler angles as the measurement of performance. We show that MAE may not reflect the actual behavior especially for the cases of profile views. To solve these two problems, we propose a new annotation method which uses three vectors to describe head poses and a new measurement Mean Absolute Error of Vectors (MAEV) to assess the performance. We also train a new neural network to predict the three vectors with the constraints of orthogonality. Our proposed method achieves state-of-the-art results on both AFLW2000 and BIWI datasets. Experiments show our vector-based annotation method can effectively reduce prediction errors for large pose angles.

* 10 pages, 8 figures 
Viaarxiv icon