Alert button
Picture for Shaozhe Hao

Shaozhe Hao

Alert button

ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

Jun 01, 2023
Shaozhe Hao, Kai Han, Shihao Zhao, Kwan-Yee K. Wong

Figure 1 for ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation
Figure 2 for ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation
Figure 3 for ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation
Figure 4 for ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

Personalized text-to-image generation using diffusion models has recently been proposed and attracted lots of attention. Given a handful of images containing a novel concept (e.g., a unique toy), we aim to tune the generative model to capture fine visual details of the novel concept and generate photorealistic images following a text condition. We present a plug-in method, named ViCo, for fast and lightweight personalized generation. Specifically, we propose an image attention module to condition the diffusion process on the patch-wise visual semantics. We introduce an attention-based object mask that comes almost at no cost from the attention module. In addition, we design a simple regularization based on the intrinsic properties of text-image attention maps to alleviate the common overfitting degradation. Unlike many existing models, our method does not finetune any parameters of the original diffusion model. This allows more flexible and transferable model deployment. With only light parameter training (~6% of the diffusion U-Net), our method achieves comparable or even better performance than all state-of-the-art models both qualitatively and quantitatively.

* Under review 
Viaarxiv icon

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

May 31, 2023
Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, Kwan-Yee K. Wong

Figure 1 for Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
Figure 2 for Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
Figure 3 for Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
Figure 4 for Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a novel approach that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at \url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}.

* Code is available at https://github.com/ShihaoZhaoZSH/Uni-ControlNet 
Viaarxiv icon

CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery

Apr 14, 2023
Shaozhe Hao, Kai Han, Kwan-Yee K. Wong

Figure 1 for CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery
Figure 2 for CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery
Figure 3 for CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery
Figure 4 for CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery

We tackle the issue of generalized category discovery (GCD). GCD considers the open-world problem of automatically clustering a partially labelled dataset, in which the unlabelled data contain instances from novel categories and also the labelled classes. In this paper, we address the GCD problem without a known category number in the unlabelled data. We propose a framework, named CiPR, to bootstrap the representation by exploiting Cross-instance Positive Relations for contrastive learning in the partially labelled data which are neglected in existing methods. First, to obtain reliable cross-instance relations to facilitate the representation learning, we introduce a semi-supervised hierarchical clustering algorithm, named selective neighbor clustering (SNC), which can produce a clustering hierarchy directly from the connected components in the graph constructed by selective neighbors. We also extend SNC to be capable of label assignment for the unlabelled instances with the given class number. Moreover, we present a method to estimate the unknown class number using SNC with a joint reference score considering clustering indexes of both labelled and unlabelled data. Finally, we thoroughly evaluate our framework on public generic image recognition datasets and challenging fine-grained datasets, all establishing the new state-of-the-art.

* Under review 
Viaarxiv icon

Learning Attention as Disentangler for Compositional Zero-shot Learning

Mar 27, 2023
Shaozhe Hao, Kai Han, Kwan-Yee K. Wong

Figure 1 for Learning Attention as Disentangler for Compositional Zero-shot Learning
Figure 2 for Learning Attention as Disentangler for Compositional Zero-shot Learning
Figure 3 for Learning Attention as Disentangler for Compositional Zero-shot Learning
Figure 4 for Learning Attention as Disentangler for Compositional Zero-shot Learning

Compositional zero-shot learning (CZSL) aims at learning visual concepts (i.e., attributes and objects) from seen compositions and combining concept knowledge into unseen compositions. The key to CZSL is learning the disentanglement of the attribute-object composition. To this end, we propose to exploit cross-attentions as compositional disentanglers to learn disentangled concept embeddings. For example, if we want to recognize an unseen composition "yellow flower", we can learn the attribute concept "yellow" and object concept "flower" from different yellow objects and different flowers respectively. To further constrain the disentanglers to learn the concept of interest, we employ a regularization at the attention level. Specifically, we adapt the earth mover's distance (EMD) as a feature similarity metric in the cross-attention module. Moreover, benefiting from concept disentanglement, we improve the inference process and tune the prediction score by combining multiple concept probabilities. Comprehensive experiments on three CZSL benchmark datasets demonstrate that our method significantly outperforms previous works in both closed- and open-world settings, establishing a new state-of-the-art.

* CVPR 2023, available at https://haoosz.github.io/ade-czsl/ 
Viaarxiv icon

A Unified Framework for Masked and Mask-Free Face Recognition via Feature Rectification

Feb 15, 2022
Shaozhe Hao, Chaofeng Chen, Zhenfang Chen, Kwan-Yee K. Wong

Figure 1 for A Unified Framework for Masked and Mask-Free Face Recognition via Feature Rectification
Figure 2 for A Unified Framework for Masked and Mask-Free Face Recognition via Feature Rectification
Figure 3 for A Unified Framework for Masked and Mask-Free Face Recognition via Feature Rectification
Figure 4 for A Unified Framework for Masked and Mask-Free Face Recognition via Feature Rectification

Face recognition under ideal conditions is now considered a well-solved problem with advances in deep learning. Recognizing faces under occlusion, however, still remains a challenge. Existing techniques often fail to recognize faces with both the mouth and nose covered by a mask, which is now very common under the COVID-19 pandemic. Common approaches to tackle this problem include 1) discarding information from the masked regions during recognition and 2) restoring the masked regions before recognition. Very few works considered the consistency between features extracted from masked faces and from their mask-free counterparts. This resulted in models trained for recognizing masked faces often showing degraded performance on mask-free faces. In this paper, we propose a unified framework, named Face Feature Rectification Network (FFR-Net), for recognizing both masked and mask-free faces alike. We introduce rectification blocks to rectify features extracted by a state-of-the-art recognition model, in both spatial and channel dimensions, to minimize the distance between a masked face and its mask-free counterpart in the rectified feature space. Experiments show that our unified framework can learn a rectified feature space for recognizing both masked and mask-free faces effectively, achieving state-of-the-art results. Project code: https://github.com/haoosz/FFR-Net

* 5 pages, 4 figures, conference 
Viaarxiv icon