Alert button
Picture for Xinggang Wang

Xinggang Wang

Alert button

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

Sep 21, 2023
Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi, Chen, Xinggang Wang, Hongyang Chao, Han Hu

In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models. The method introduces two core techniques: affinity mimicking and weight inheritance. Affinity mimicking explores the interaction between modalities during distillation, enabling student models to mimic teachers' behavior of learning cross-modal feature alignment in a visual-linguistic affinity space. Weight inheritance transmits the pre-trained weights from the teacher models to their student counterparts to improve distillation efficiency. Moreover, we extend the method into a multi-stage progressive distillation to mitigate the loss of informative weights during extreme compression. Comprehensive experiments demonstrate the efficacy of TinyCLIP, showing that it can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance. While aiming for comparable performance, distillation with weight inheritance can speed up the training by 1.4 - 7.8 $\times$ compared to training from scratch. Moreover, our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet, surpassing the original CLIP ViT-B/16 by 3.5% while utilizing only 8.9% parameters. Finally, we demonstrate the good transferability of TinyCLIP in various downstream tasks. Code and models will be open-sourced at https://aka.ms/tinyclip.

* Accepted By ICCV 2023 
Viaarxiv icon

TiAVox: Time-aware Attenuation Voxels for Sparse-view 4D DSA Reconstruction

Sep 05, 2023
Zhenghong Zhou, Huangxuan Zhao, Jiemin Fang, Dongqiao Xiang, Lei Chen, Lingxia Wu, Feihong Wu, Wenyu Liu, Chuansheng Zheng, Xinggang Wang

Four-dimensional Digital Subtraction Angiography (4D DSA) plays a critical role in the diagnosis of many medical diseases, such as Arteriovenous Malformations (AVM) and Arteriovenous Fistulas (AVF). Despite its significant application value, the reconstruction of 4D DSA demands numerous views to effectively model the intricate vessels and radiocontrast flow, thereby implying a significant radiation dose. To address this high radiation issue, we propose a Time-aware Attenuation Voxel (TiAVox) approach for sparse-view 4D DSA reconstruction, which paves the way for high-quality 4D imaging. Additionally, 2D and 3D DSA imaging results can be generated from the reconstructed 4D DSA images. TiAVox introduces 4D attenuation voxel grids, which reflect attenuation properties from both spatial and temporal dimensions. It is optimized by minimizing discrepancies between the rendered images and sparse 2D DSA images. Without any neural network involved, TiAVox enjoys specific physical interpretability. The parameters of each learnable voxel represent the attenuation coefficients. We validated the TiAVox approach on both clinical and simulated datasets, achieving a 31.23 Peak Signal-to-Noise Ratio (PSNR) for novel view synthesis using only 30 views on the clinically sourced dataset, whereas traditional Feldkamp-Davis-Kress methods required 133 views. Similarly, with merely 10 views from the synthetic dataset, TiAVox yielded a PSNR of 34.32 for novel view synthesis and 41.40 for 3D reconstruction. We also executed ablation studies to corroborate the essential components of TiAVox. The code will be publically available.

* 10 pages, 8 figures 
Viaarxiv icon

TouchStone: Evaluating Vision-Language Models by Language Models

Sep 04, 2023
Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, Jingren Zhou

Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual storytelling abilities. In this paper, we propose an evaluation method that uses strong LLMs as judges to comprehensively evaluate the various abilities of LVLMs. Firstly, we construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. This dataset not only covers fundamental recognition and comprehension but also extends to literary creation. Secondly, by integrating detailed image annotations we effectively transform the multimodal input content into a form understandable by LLMs. This enables us to employ advanced LLMs for directly evaluating the quality of the multimodal dialogue without requiring human intervention. Through validation, we demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone, aligning with human preferences. We hope our work can serve as a touchstone for LVLMs' evaluation and pave the way for building stronger LVLMs. The evaluation code is available at https://github.com/OFA-Sys/TouchStone.

* https://github.com/OFA-Sys/TouchStone 
Viaarxiv icon

Condition-Adaptive Graph Convolution Learning for Skeleton-Based Gait Recognition

Aug 13, 2023
Xiaohu Huang, Xinggang Wang, Zhidianqiu Jin, Bo Yang, Botao He, Bin Feng, Wenyu Liu

Figure 1 for Condition-Adaptive Graph Convolution Learning for Skeleton-Based Gait Recognition
Figure 2 for Condition-Adaptive Graph Convolution Learning for Skeleton-Based Gait Recognition
Figure 3 for Condition-Adaptive Graph Convolution Learning for Skeleton-Based Gait Recognition
Figure 4 for Condition-Adaptive Graph Convolution Learning for Skeleton-Based Gait Recognition

Graph convolutional networks have been widely applied in skeleton-based gait recognition. A key challenge in this task is to distinguish the individual walking styles of different subjects across various views. Existing state-of-the-art methods employ uniform convolutions to extract features from diverse sequences and ignore the effects of viewpoint changes. To overcome these limitations, we propose a condition-adaptive graph (CAG) convolution network that can dynamically adapt to the specific attributes of each skeleton sequence and the corresponding view angle. In contrast to using fixed weights for all joints and sequences, we introduce a joint-specific filter learning (JSFL) module in the CAG method, which produces sequence-adaptive filters at the joint level. The adaptive filters capture fine-grained patterns that are unique to each joint, enabling the extraction of diverse spatial-temporal information about body parts. Additionally, we design a view-adaptive topology learning (VATL) module that generates adaptive graph topologies. These graph topologies are used to correlate the joints adaptively according to the specific view conditions. Thus, CAG can simultaneously adjust to various walking styles and viewpoints. Experiments on the two most widely used datasets (i.e., CASIA-B and OU-MVLP) show that CAG surpasses all previous skeleton-based methods. Moreover, the recognition performance can be enhanced by simply combining CAG with appearance-based methods, demonstrating the ability of CAG to provide useful complementary information.The source code will be available at https://github.com/OliverHxh/CAG.

* Accepted by TIP journal 
Viaarxiv icon

MapTRv2: An End-to-End Framework for Online Vectorized HD Map Construction

Aug 10, 2023
Bencheng Liao, Shaoyu Chen, Yunchi Zhang, Bo Jiang, Qian Zhang, Wenyu Liu, Chang Huang, Xinggang Wang

Figure 1 for MapTRv2: An End-to-End Framework for Online Vectorized HD Map Construction
Figure 2 for MapTRv2: An End-to-End Framework for Online Vectorized HD Map Construction
Figure 3 for MapTRv2: An End-to-End Framework for Online Vectorized HD Map Construction
Figure 4 for MapTRv2: An End-to-End Framework for Online Vectorized HD Map Construction

High-definition (HD) map provides abundant and precise static environmental information of the driving scene, serving as a fundamental and indispensable component for planning in autonomous driving system. In this paper, we present \textbf{Map} \textbf{TR}ansformer, an end-to-end framework for online vectorized HD map construction. We propose a unified permutation-equivalent modeling approach, \ie, modeling map element as a point set with a group of equivalent permutations, which accurately describes the shape of map element and stabilizes the learning process. We design a hierarchical query embedding scheme to flexibly encode structured map information and perform hierarchical bipartite matching for map element learning. To speed up convergence, we further introduce auxiliary one-to-many matching and dense supervision. The proposed method well copes with various map elements with arbitrary shapes. It runs at real-time inference speed and achieves state-of-the-art performance on both nuScenes and Argoverse2 datasets. Abundant qualitative results show stable and robust map construction quality in complex and various driving scenes. Code and more demos are available at \url{https://github.com/hustvl/MapTR} for facilitating further studies and applications.

* Code available at https://github.com/hustvl/MapTR . arXiv admin note: substantial text overlap with arXiv:2208.14437 
Viaarxiv icon

Symphonize 3D Semantic Scene Completion with Contextual Instance Queries

Jun 27, 2023
Haoyi Jiang, Tianheng Cheng, Naiyu Gao, Haoyang Zhang, Wenyu Liu, Xinggang Wang

Figure 1 for Symphonize 3D Semantic Scene Completion with Contextual Instance Queries
Figure 2 for Symphonize 3D Semantic Scene Completion with Contextual Instance Queries
Figure 3 for Symphonize 3D Semantic Scene Completion with Contextual Instance Queries
Figure 4 for Symphonize 3D Semantic Scene Completion with Contextual Instance Queries

3D Semantic Scene Completion (SSC) has emerged as a nascent and pivotal task for autonomous driving, as it involves predicting per-voxel occupancy within a 3D scene from partial LiDAR or image inputs. Existing methods primarily focus on the voxel-wise feature aggregation, while neglecting the instance-centric semantics and broader context. In this paper, we present a novel paradigm termed Symphonies (Scene-from-Insts) for SSC, which completes the scene volume from a sparse set of instance queries derived from the input with context awareness. By incorporating the queries as the instance feature representations within the scene, Symphonies dynamically encodes the instance-centric semantics to interact with the image and volume features while avoiding the dense voxel-wise modeling. Simultaneously, it orchestrates a more comprehensive understanding of the scenario by capturing context throughout the entire scene, contributing to alleviating the geometric ambiguity derived from occlusion and perspective errors. Symphonies achieves a state-of-the-art result of 13.02 mIoU on the challenging SemanticKITTI dataset, outperforming existing methods and showcasing the promising advancements of the paradigm. The code is available at \url{https://github.com/hustvl/Symphonies}.

* Technical report. Code and models at: https://github.com/hustvl/Symphonies 
Viaarxiv icon

ProRes: Exploring Degradation-aware Visual Prompt for Universal Image Restoration

Jun 23, 2023
Jiaqi Ma, Tianheng Cheng, Guoli Wang, Qian Zhang, Xinggang Wang, Lefei Zhang

Figure 1 for ProRes: Exploring Degradation-aware Visual Prompt for Universal Image Restoration
Figure 2 for ProRes: Exploring Degradation-aware Visual Prompt for Universal Image Restoration
Figure 3 for ProRes: Exploring Degradation-aware Visual Prompt for Universal Image Restoration
Figure 4 for ProRes: Exploring Degradation-aware Visual Prompt for Universal Image Restoration

Image restoration aims to reconstruct degraded images, e.g., denoising or deblurring. Existing works focus on designing task-specific methods and there are inadequate attempts at universal methods. However, simply unifying multiple tasks into one universal architecture suffers from uncontrollable and undesired predictions. To address those issues, we explore prompt learning in universal architectures for image restoration tasks. In this paper, we present Degradation-aware Visual Prompts, which encode various types of image degradation, e.g., noise and blur, into unified visual prompts. These degradation-aware prompts provide control over image processing and allow weighted combinations for customized image restoration. We then leverage degradation-aware visual prompts to establish a controllable and universal model for image restoration, called ProRes, which is applicable to an extensive range of image restoration tasks. ProRes leverages the vanilla Vision Transformer (ViT) without any task-specific designs. Furthermore, the pre-trained ProRes can easily adapt to new tasks through efficient prompt tuning with only a few images. Without bells and whistles, ProRes achieves competitive performance compared to task-specific methods and experiments can demonstrate its ability for controllable restoration and adaptation for new tasks. The code and models will be released in \url{https://github.com/leonmakise/ProRes}.

Viaarxiv icon

SparseTrack: Multi-Object Tracking by Performing Scene Decomposition based on Pseudo-Depth

Jun 08, 2023
Zelin Liu, Xinggang Wang, Cheng Wang, Wenyu Liu, Xiang Bai

Figure 1 for SparseTrack: Multi-Object Tracking by Performing Scene Decomposition based on Pseudo-Depth
Figure 2 for SparseTrack: Multi-Object Tracking by Performing Scene Decomposition based on Pseudo-Depth
Figure 3 for SparseTrack: Multi-Object Tracking by Performing Scene Decomposition based on Pseudo-Depth
Figure 4 for SparseTrack: Multi-Object Tracking by Performing Scene Decomposition based on Pseudo-Depth

Exploring robust and efficient association methods has always been an important issue in multiple-object tracking (MOT). Although existing tracking methods have achieved impressive performance, congestion and frequent occlusions still pose challenging problems in multi-object tracking. We reveal that performing sparse decomposition on dense scenes is a crucial step to enhance the performance of associating occluded targets. To this end, we propose a pseudo-depth estimation method for obtaining the relative depth of targets from 2D images. Secondly, we design a depth cascading matching (DCM) algorithm, which can use the obtained depth information to convert a dense target set into multiple sparse target subsets and perform data association on these sparse target subsets in order from near to far. By integrating the pseudo-depth method and the DCM strategy into the data association process, we propose a new tracker, called SparseTrack. SparseTrack provides a new perspective for solving the challenging crowded scene MOT problem. Only using IoU matching, SparseTrack achieves comparable performance with the state-of-the-art (SOTA) methods on the MOT17 and MOT20 benchmarks. Code and models are publicly available at \url{https://github.com/hustvl/SparseTrack}.

* 12 pages, 8 figures 
Viaarxiv icon

Multi-level Multiple Instance Learning with Transformer for Whole Slide Image Classification

Jun 08, 2023
Ruijie Zhang, Qiaozhe Zhang, Yingzhuang Liu, Hao Xin, Yan Liu, Xinggang Wang

Figure 1 for Multi-level Multiple Instance Learning with Transformer for Whole Slide Image Classification
Figure 2 for Multi-level Multiple Instance Learning with Transformer for Whole Slide Image Classification
Figure 3 for Multi-level Multiple Instance Learning with Transformer for Whole Slide Image Classification
Figure 4 for Multi-level Multiple Instance Learning with Transformer for Whole Slide Image Classification

Whole slide image (WSI) refers to a type of high-resolution scanned tissue image, which is extensively employed in computer-assisted diagnosis (CAD). The extremely high resolution and limited availability of region-level annotations make it challenging to employ deep learning methods for WSI-based digital diagnosis. Multiple instance learning (MIL) is a powerful tool to address the weak annotation problem, while Transformer has shown great success in the field of visual tasks. The combination of both should provide new insights for deep learning based image diagnosis. However, due to the limitations of single-level MIL and the attention mechanism's constraints on sequence length, directly applying Transformer to WSI-based MIL tasks is not practical. To tackle this issue, we propose a Multi-level MIL with Transformer (MMIL-Transformer) approach. By introducing a hierarchical structure to MIL, this approach enables efficient handling of MIL tasks that involve a large number of instances. To validate its effectiveness, we conducted a set of experiments on WSIs classification task, where MMIL-Transformer demonstrate superior performance compared to existing state-of-the-art methods. Our proposed approach achieves test AUC 94.74% and test accuracy 93.41% on CAMELYON16 dataset, test AUC 99.04% and test accuracy 94.37% on TCGA-NSCLC dataset, respectively. All code and pre-trained models are available at: https://github.com/hustvl/MMIL-Transformer

Viaarxiv icon