Alert button
Picture for Cheng Han

Cheng Han

Alert button

E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning

Jul 25, 2023
Cheng Han, Qifan Wang, Yiming Cui, Zhiwen Cao, Wenguan Wang, Siyuan Qi, Dongfang Liu

Figure 1 for E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning
Figure 2 for E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning
Figure 3 for E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning
Figure 4 for E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning

As the size of transformer-based models continues to grow, fine-tuning these large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. Parameter-efficient learning has been developed to reduce the number of tunable parameters during fine-tuning. Although these methods show promising results, there is still a significant performance gap compared to full fine-tuning. To address this challenge, we propose an Effective and Efficient Visual Prompt Tuning (E^2VPT) approach for large-scale transformer-based model adaptation. Specifically, we introduce a set of learnable key-value prompts and visual prompts into self-attention and input layers, respectively, to improve the effectiveness of model fine-tuning. Moreover, we design a prompt pruning procedure to systematically prune low importance prompts while preserving model performance, which largely enhances the model's efficiency. Empirical results demonstrate that our approach outperforms several state-of-the-art baselines on two benchmarks, with considerably low parameter usage (e.g., 0.32% of model parameters on VTAB-1k). Our code is available at https://github.com/ChengHan111/E2VPT.

* 12 pages, 4 figures 
Viaarxiv icon

Visual Recognition with Deep Nearest Centroids

Sep 15, 2022
Wenguan Wang, Cheng Han, Tianfei Zhou, Dongfang Liu

Figure 1 for Visual Recognition with Deep Nearest Centroids
Figure 2 for Visual Recognition with Deep Nearest Centroids
Figure 3 for Visual Recognition with Deep Nearest Centroids
Figure 4 for Visual Recognition with Deep Nearest Centroids

We devise deep nearest centroids (DNC), a conceptually elegant yet surprisingly effective network for large-scale visual recognition, by revisiting Nearest Centroids, one of the most classic and simple classifiers. Current deep models learn the classifier in a fully parametric manner, ignoring the latent data structure and lacking simplicity and explainability. DNC instead conducts nonparametric, case-based reasoning; it utilizes sub-centroids of training samples to describe class distributions and clearly explains the classification as the proximity of test data and the class sub-centroids in the feature space. Due to the distance-based nature, the network output dimensionality is flexible, and all the learnable parameters are only for data embedding. That means all the knowledge learnt for ImageNet classification can be completely transferred for pixel recognition learning, under the "pre-training and fine-tuning" paradigm. Apart from its nested simplicity and intuitive decision-making mechanism, DNC can even possess ad-hoc explainability when the sub-centroids are selected as actual training images that humans can view and inspect. Compared with parametric counterparts, DNC performs better on image classification (CIFAR-10, ImageNet) and greatly boots pixel recognition (ADE20K, Cityscapes), with improved transparency and fewer learnable parameters, using various network architectures (ResNet, Swin) and segmentation models (FCN, DeepLabV3, Swin). We feel this work brings fundamental insights into related fields.

* 23 pages, 8 figures 
Viaarxiv icon

YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception

Aug 24, 2022
Cheng Han, Qichao Zhao, Shuyi Zhang, Yinzi Chen, Zhenlin Zhang, Jinwei Yuan

Figure 1 for YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception
Figure 2 for YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception
Figure 3 for YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception
Figure 4 for YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception

Over the last decade, multi-tasking learning approaches have achieved promising results in solving panoptic driving perception problems, providing both high-precision and high-efficiency performance. It has become a popular paradigm when designing networks for real-time practical autonomous driving system, where computation resources are limited. This paper proposed an effective and efficient multi-task learning network to simultaneously perform the task of traffic object detection, drivable road area segmentation and lane detection. Our model achieved the new state-of-the-art (SOTA) performance in terms of accuracy and speed on the challenging BDD100K dataset. Especially, the inference time is reduced by half compared to the previous SOTA model. Code will be released in the near future.

Viaarxiv icon