Alert button
Picture for Kai Han

Kai Han

Alert button

VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale

May 25, 2023
Zhiwei Hao, Jianyuan Guo, Kai Han, Han Hu, Chang Xu, Yunhe Wang

Figure 1 for VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale
Figure 2 for VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale
Figure 3 for VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale
Figure 4 for VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale

The tremendous success of large models trained on extensive datasets demonstrates that scale is a key ingredient in achieving superior results. Therefore, the reflection on the rationality of designing knowledge distillation (KD) approaches for limited-capacity architectures solely based on small-scale datasets is now deemed imperative. In this paper, we identify the \emph{small data pitfall} that presents in previous KD methods, which results in the underestimation of the power of vanilla KD framework on large-scale datasets such as ImageNet-1K. Specifically, we show that employing stronger data augmentation techniques and using larger datasets can directly decrease the gap between vanilla KD and other meticulously designed KD variants. This highlights the necessity of designing and evaluating KD approaches in the context of practical scenarios, casting off the limitations of small-scale datasets. Our investigation of the vanilla KD and its variants in more complex schemes, including stronger training strategies and different model capacities, demonstrates that vanilla KD is elegantly simple but astonishingly effective in large-scale scenarios. Without bells and whistles, we obtain state-of-the-art ResNet-50, ViT-S, and ConvNeXtV2-T models for ImageNet, which achieve 83.1\%, 84.3\%, and 85.0\% top-1 accuracy, respectively. PyTorch code and checkpoints can be found at https://github.com/Hao840/vanillaKD.

Viaarxiv icon

Learning Semi-supervised Gaussian Mixture Models for Generalized Category Discovery

May 10, 2023
Bingchen Zhao, Xin Wen, Kai Han

Figure 1 for Learning Semi-supervised Gaussian Mixture Models for Generalized Category Discovery
Figure 2 for Learning Semi-supervised Gaussian Mixture Models for Generalized Category Discovery
Figure 3 for Learning Semi-supervised Gaussian Mixture Models for Generalized Category Discovery
Figure 4 for Learning Semi-supervised Gaussian Mixture Models for Generalized Category Discovery

In this paper, we address the problem of generalized category discovery (GCD), \ie, given a set of images where part of them are labelled and the rest are not, the task is to automatically cluster the images in the unlabelled data, leveraging the information from the labelled data, while the unlabelled data contain images from the labelled classes and also new ones. GCD is similar to semi-supervised learning (SSL) but is more realistic and challenging, as SSL assumes all the unlabelled images are from the same classes as the labelled ones. We also do not assume the class number in the unlabelled data is known a-priori, making the GCD problem even harder. To tackle the problem of GCD without knowing the class number, we propose an EM-like framework that alternates between representation learning and class number estimation. We propose a semi-supervised variant of the Gaussian Mixture Model (GMM) with a stochastic splitting and merging mechanism to dynamically determine the prototypes by examining the cluster compactness and separability. With these prototypes, we leverage prototypical contrastive learning for representation learning on the partially labelled data subject to the constraints imposed by the labelled data. Our framework alternates between these two steps until convergence. The cluster assignment for an unlabelled instance can then be retrieved by identifying its nearest prototype. We comprehensively evaluate our framework on both generic image classification datasets and challenging fine-grained object recognition datasets, achieving state-of-the-art performance.

Viaarxiv icon

SimSC: A Simple Framework for Semantic Correspondence with Temperature Learning

May 03, 2023
Xinghui Li, Kai Han, Xingchen Wan, Victor Adrian Prisacariu

Figure 1 for SimSC: A Simple Framework for Semantic Correspondence with Temperature Learning
Figure 2 for SimSC: A Simple Framework for Semantic Correspondence with Temperature Learning
Figure 3 for SimSC: A Simple Framework for Semantic Correspondence with Temperature Learning
Figure 4 for SimSC: A Simple Framework for Semantic Correspondence with Temperature Learning

We propose SimSC, a remarkably simple framework, to address the problem of semantic matching only based on the feature backbone. We discover that when fine-tuning ImageNet pre-trained backbone on the semantic matching task, L2 normalization of the feature map, a standard procedure in feature matching, produces an overly smooth matching distribution and significantly hinders the fine-tuning process. By setting an appropriate temperature to the softmax, this over-smoothness can be alleviated and the quality of features can be substantially improved. We employ a learning module to predict the optimal temperature for fine-tuning feature backbones. This module is trained together with the backbone and the temperature is updated online. We evaluate our method on three public datasets and demonstrate that we can achieve accuracy on par with state-of-the-art methods under the same backbone without using a learned matching head. Our method is versatile and works on various types of backbones. We show that the accuracy of our framework can be easily improved by coupling it with more powerful backbones.

Viaarxiv icon

SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models

Apr 23, 2023
Jonathan Roberts, Kai Han, Samuel Albanie

Figure 1 for SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models
Figure 2 for SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models
Figure 3 for SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models
Figure 4 for SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models

Interpreting remote sensing imagery enables numerous downstream applications ranging from land-use planning to deforestation monitoring. Robustly classifying this data is challenging due to the Earth's geographic diversity. While many distinct satellite and aerial image classification datasets exist, there is yet to be a benchmark curated that suitably covers this diversity. In this work, we introduce SATellite ImageNet (SATIN), a metadataset curated from 27 existing remotely sensed datasets, and comprehensively evaluate the zero-shot transfer classification capabilities of a broad range of vision-language (VL) models on SATIN. We find SATIN to be a challenging benchmark-the strongest method we evaluate achieves a classification accuracy of 52.0%. We provide a $\href{https://satinbenchmark.github.io}{\text{public leaderboard}}$ to guide and track the progress of VL models in this important domain.

Viaarxiv icon

CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery

Apr 14, 2023
Shaozhe Hao, Kai Han, Kwan-Yee K. Wong

Figure 1 for CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery
Figure 2 for CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery
Figure 3 for CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery
Figure 4 for CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery

We tackle the issue of generalized category discovery (GCD). GCD considers the open-world problem of automatically clustering a partially labelled dataset, in which the unlabelled data contain instances from novel categories and also the labelled classes. In this paper, we address the GCD problem without a known category number in the unlabelled data. We propose a framework, named CiPR, to bootstrap the representation by exploiting Cross-instance Positive Relations for contrastive learning in the partially labelled data which are neglected in existing methods. First, to obtain reliable cross-instance relations to facilitate the representation learning, we introduce a semi-supervised hierarchical clustering algorithm, named selective neighbor clustering (SNC), which can produce a clustering hierarchy directly from the connected components in the graph constructed by selective neighbors. We also extend SNC to be capable of label assignment for the unlabelled instances with the given class number. Moreover, we present a method to estimate the unknown class number using SNC with a joint reference score considering clustering indexes of both labelled and unlabelled data. Finally, we thoroughly evaluate our framework on public generic image recognition datasets and challenging fine-grained datasets, all establishing the new state-of-the-art.

* Under review 
Viaarxiv icon

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

Apr 06, 2023
Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong

Figure 1 for DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models
Figure 2 for DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models
Figure 3 for DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models
Figure 4 for DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

We present DreamAvatar, a text-and-shape guided framework for generating high-quality 3D human avatars with controllable poses. While encouraging results have been produced by recent methods on text-guided 3D common object generation, generating high-quality human avatars remains an open challenge due to the complexity of the human body's shape, pose, and appearance. We propose DreamAvatar to tackle this challenge, which utilizes a trainable NeRF for predicting density and color features for 3D points and a pre-trained text-to-image diffusion model for providing 2D self-supervision. Specifically, we leverage SMPL models to provide rough pose and shape guidance for the generation. We introduce a dual space design that comprises a canonical space and an observation space, which are related by a learnable deformation field through the NeRF, allowing for the transfer of well-optimized texture and geometry from the canonical space to the target posed avatar. Additionally, we exploit a normal-consistency regularization to allow for more vivid generation with detailed geometry and texture. Through extensive evaluations, we demonstrate that DreamAvatar significantly outperforms existing methods, establishing a new state-of-the-art for text-and-shape guided 3D human generation.

* 19 pages, 19 figures. Project page: https://yukangcao.github.io/DreamAvatar/ 
Viaarxiv icon

What's in a Name? Beyond Class Indices for Image Recognition

Apr 05, 2023
Kai Han, Yandong Li, Sagar Vaze, Jie Li, Xuhui Jia

Figure 1 for What's in a Name? Beyond Class Indices for Image Recognition
Figure 2 for What's in a Name? Beyond Class Indices for Image Recognition
Figure 3 for What's in a Name? Beyond Class Indices for Image Recognition
Figure 4 for What's in a Name? Beyond Class Indices for Image Recognition

Existing machine learning models demonstrate excellent performance in image object recognition after training on a large-scale dataset under full supervision. However, these models only learn to map an image to a predefined class index, without revealing the actual semantic meaning of the object in the image. In contrast, vision-language models like CLIP are able to assign semantic class names to unseen objects in a `zero-shot' manner, although they still rely on a predefined set of candidate names at test time. In this paper, we reconsider the recognition problem and task a vision-language model to assign class names to images given only a large and essentially unconstrained vocabulary of categories as prior information. We use non-parametric methods to establish relationships between images which allow the model to automatically narrow down the set of possible candidate names. Specifically, we propose iteratively clustering the data and voting on class names within them, showing that this enables a roughly 50\% improvement over the baseline on ImageNet. Furthermore, we tackle this problem both in unsupervised and partially supervised settings, as well as with a coarse-grained and fine-grained search space as the unconstrained dictionary.

Viaarxiv icon

Zero-Shot Semantic Segmentation with Decoupled One-Pass Network

Apr 03, 2023
Cong Han, Yujie Zhong, Dengjie Li, Kai Han, Lin Ma

Figure 1 for Zero-Shot Semantic Segmentation with Decoupled One-Pass Network
Figure 2 for Zero-Shot Semantic Segmentation with Decoupled One-Pass Network
Figure 3 for Zero-Shot Semantic Segmentation with Decoupled One-Pass Network
Figure 4 for Zero-Shot Semantic Segmentation with Decoupled One-Pass Network

Recently, the zero-shot semantic segmentation problem has attracted increasing attention, and the best performing methods are based on two-stream networks: one stream for proposal mask generation and the other for segment classification using a pre-trained visual-language model. However, existing two-stream methods require passing a great number of (up to a hundred) image crops into the visuallanguage model, which is highly inefficient. To address the problem, we propose a network that only needs a single pass through the visual-language model for each input image. Specifically, we first propose a novel network adaptation approach, termed patch severance, to restrict the harmful interference between the patch embeddings in the pre-trained visual encoder. We then propose classification anchor learning to encourage the network to spatially focus on more discriminative features for classification. Extensive experiments demonstrate that the proposed method achieves outstanding performance, surpassing state-of-theart methods while being 4 to 7 times faster at inference. We release our code at https://github.com/CongHan0808/DeOP.git.

* 13pages, 9 figures 
Viaarxiv icon

SeSDF: Self-evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction

Apr 01, 2023
Yukang Cao, Kai Han, Kwan-Yee K. Wong

Figure 1 for SeSDF: Self-evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction
Figure 2 for SeSDF: Self-evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction
Figure 3 for SeSDF: Self-evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction
Figure 4 for SeSDF: Self-evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction

We address the problem of clothed human reconstruction from a single image or uncalibrated multi-view images. Existing methods struggle with reconstructing detailed geometry of a clothed human and often require a calibrated setting for multi-view reconstruction. We propose a flexible framework which, by leveraging the parametric SMPL-X model, can take an arbitrary number of input images to reconstruct a clothed human model under an uncalibrated setting. At the core of our framework is our novel self-evolved signed distance field (SeSDF) module which allows the framework to learn to deform the signed distance field (SDF) derived from the fitted SMPL-X model, such that detailed geometry reflecting the actual clothed human can be encoded for better reconstruction. Besides, we propose a simple method for self-calibration of multi-view images via the fitted SMPL-X parameters. This lifts the requirement of tedious manual calibration and largely increases the flexibility of our method. Further, we introduce an effective occlusion-aware feature fusion strategy to account for the most useful features to reconstruct the human model. We thoroughly evaluate our framework on public benchmarks, demonstrating significant superiority over the state-of-the-arts both qualitatively and quantitatively.

* 25 pages, 21 figures 
Viaarxiv icon