Alert button
Picture for Dachuan Shi

Dachuan Shi

Alert button

CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

May 27, 2023
Dachuan Shi, Chaofan Tao, Anyi Rao, Zhendong Yang, Chun Yuan, Jiaqi Wang

Figure 1 for CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
Figure 2 for CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
Figure 3 for CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
Figure 4 for CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

Vision-language models have achieved tremendous progress far beyond what we ever expected. However, their computational costs and latency are also dramatically growing with rapid development, making model acceleration exceedingly critical for researchers with limited resources and consumers with low-end devices. Although extensively studied for unimodal models, the acceleration for multimodal models, especially the vision-language Transformers, is still relatively under-explored. Accordingly, this paper proposes \textbf{Cross}-\textbf{G}uided \textbf{E}nsemble of \textbf{T}okens (\textbf{\emph{CrossGET}}) as a universal vison-language Transformer acceleration framework, which adaptively reduces token numbers during inference via cross-modal guidance on-the-fly, leading to significant model acceleration while keeping high performance. Specifically, the proposed \textit{CrossGET} has two key designs:1) \textit{Cross-Guided Matching and Ensemble}. \textit{CrossGET} incorporates cross-modal guided token matching and ensemble to merge tokens effectively, only introducing cross-modal tokens with negligible extra parameters. 2) \textit{Complete-Graph Soft Matching}. In contrast to the previous bipartite soft matching approach, \textit{CrossGET} introduces an efficient and effective complete-graph soft matching policy to achieve more reliable token-matching results. Extensive experiments on various vision-language tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed \textit{CrossGET} framework. The code will be at https://github.com/sdc17/CrossGET.

* Preprint 
Viaarxiv icon

UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

Jan 31, 2023
Dachuan Shi, Chaofan Tao, Ying Jin, Zhendong Yang, Chun Yuan, Jiaqi Wang

Figure 1 for UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers
Figure 2 for UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers
Figure 3 for UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers
Figure 4 for UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, e.g., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (UPop) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on multiple generative and discriminative vision-language tasks, including Visual Reasoning, Image Caption, Visual Question Answer, Image-Text Retrieval, Text-Image Retrieval, and Image Classification, demonstrate the effectiveness and versatility of the proposed UPop framework.

* 16 pages, 5 figures, 13 tables 
Viaarxiv icon

Masked Generative Distillation

May 03, 2022
Zhendong Yang, Zhe Li, Mingqi Shao, Dachuan Shi, Zehuan Yuan, Chun Yuan

Figure 1 for Masked Generative Distillation
Figure 2 for Masked Generative Distillation
Figure 3 for Masked Generative Distillation
Figure 4 for Masked Generative Distillation

Knowledge distillation has been applied to various tasks successfully. The current distillation algorithm usually improves students' performance by imitating the output of the teacher. This paper shows that teachers can also improve students' representation power by guiding students' feature recovery. From this point of view, we propose Masked Generative Distillation (MGD), which is simple: we mask random pixels of the student's feature and force it to generate the teacher's full feature through a simple block. MGD is a truly general feature-based distillation method, which can be utilized on various tasks, including image classification, object detection, semantic segmentation and instance segmentation. We experiment on different models with extensive datasets and the results show that all the students achieve excellent improvements. Notably, we boost ResNet-18 from 69.90% to 71.69% ImageNet top-1 accuracy, RetinaNet with ResNet-50 backbone from 37.4 to 41.0 Boundingbox mAP, SOLO based on ResNet-50 from 33.1 to 36.2 Mask mAP and DeepLabV3 based on ResNet-18 from 73.20 to 76.02 mIoU. Our codes are available at https://github.com/yzd-v/MGD.

* 5 figures, 8 tables 
Viaarxiv icon

Deep Learning based Virtual Point Tracking for Real-Time Target-less Dynamic Displacement Measurement in Railway Applications

Jan 20, 2021
Dachuan Shi, Eldar Sabanovic, Luca Rizzetto, Viktor Skrickij, Roberto Oliverio, Nadia Kaviani, Yunguang Ye, Gintautas Bureika, Stefano Ricci, Markus Hecht

Figure 1 for Deep Learning based Virtual Point Tracking for Real-Time Target-less Dynamic Displacement Measurement in Railway Applications
Figure 2 for Deep Learning based Virtual Point Tracking for Real-Time Target-less Dynamic Displacement Measurement in Railway Applications
Figure 3 for Deep Learning based Virtual Point Tracking for Real-Time Target-less Dynamic Displacement Measurement in Railway Applications
Figure 4 for Deep Learning based Virtual Point Tracking for Real-Time Target-less Dynamic Displacement Measurement in Railway Applications

In the application of computer-vision based displacement measurement, an optical target is usually required to prove the reference. In the case that the optical target cannot be attached to the measuring objective, edge detection, feature matching and template matching are the most common approaches in target-less photogrammetry. However, their performance significantly relies on parameter settings. This becomes problematic in dynamic scenes where complicated background texture exists and varies over time. To tackle this issue, we propose virtual point tracking for real-time target-less dynamic displacement measurement, incorporating deep learning techniques and domain knowledge. Our approach consists of three steps: 1) automatic calibration for detection of region of interest; 2) virtual point detection for each video frame using deep convolutional neural network; 3) domain-knowledge based rule engine for point tracking in adjacent frames. The proposed approach can be executed on an edge computer in a real-time manner (i.e. over 30 frames per second). We demonstrate our approach for a railway application, where the lateral displacement of the wheel on the rail is measured during operation. We also implement an algorithm using template matching and line detection as the baseline for comparison. The numerical experiments have been performed to evaluate the performance and the latency of our approach in the harsh railway environment with noisy and varying backgrounds.

Viaarxiv icon