Alert button
Picture for Shihao Zhao

Shihao Zhao

Alert button

ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

Jun 01, 2023
Shaozhe Hao, Kai Han, Shihao Zhao, Kwan-Yee K. Wong

Figure 1 for ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation
Figure 2 for ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation
Figure 3 for ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation
Figure 4 for ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

Personalized text-to-image generation using diffusion models has recently been proposed and attracted lots of attention. Given a handful of images containing a novel concept (e.g., a unique toy), we aim to tune the generative model to capture fine visual details of the novel concept and generate photorealistic images following a text condition. We present a plug-in method, named ViCo, for fast and lightweight personalized generation. Specifically, we propose an image attention module to condition the diffusion process on the patch-wise visual semantics. We introduce an attention-based object mask that comes almost at no cost from the attention module. In addition, we design a simple regularization based on the intrinsic properties of text-image attention maps to alleviate the common overfitting degradation. Unlike many existing models, our method does not finetune any parameters of the original diffusion model. This allows more flexible and transferable model deployment. With only light parameter training (~6% of the diffusion U-Net), our method achieves comparable or even better performance than all state-of-the-art models both qualitatively and quantitatively.

* Under review 
Viaarxiv icon

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

May 31, 2023
Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, Kwan-Yee K. Wong

Figure 1 for Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
Figure 2 for Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
Figure 3 for Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
Figure 4 for Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Text-to-Image diffusion models have made tremendous progress over the past two years, enabling the generation of highly realistic images based on open-domain text descriptions. However, despite their success, text descriptions often struggle to adequately convey detailed controls, even when composed of long and complex texts. Moreover, recent studies have also shown that these models face challenges in understanding such complex texts and generating the corresponding images. Therefore, there is a growing need to enable more control modes beyond text description. In this paper, we introduce Uni-ControlNet, a novel approach that allows for the simultaneous utilization of different local controls (e.g., edge maps, depth map, segmentation masks) and global controls (e.g., CLIP image embeddings) in a flexible and composable manner within one model. Unlike existing methods, Uni-ControlNet only requires the fine-tuning of two additional adapters upon frozen pre-trained text-to-image diffusion models, eliminating the huge cost of training from scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2) of adapters, regardless of the number of local or global controls used. This not only reduces the fine-tuning costs and model size, making it more suitable for real-world deployment, but also facilitate composability of different conditions. Through both quantitative and qualitative comparisons, Uni-ControlNet demonstrates its superiority over existing methods in terms of controllability, generation quality and composability. Code is available at \url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}.

* Code is available at https://github.com/ShihaoZhaoZSH/Uni-ControlNet 
Viaarxiv icon

Revisiting Adversarial Robustness Distillation: Robust Soft Labels Make Student Better

Aug 18, 2021
Bojia Zi, Shihao Zhao, Xingjun Ma, Yu-Gang Jiang

Figure 1 for Revisiting Adversarial Robustness Distillation: Robust Soft Labels Make Student Better
Figure 2 for Revisiting Adversarial Robustness Distillation: Robust Soft Labels Make Student Better
Figure 3 for Revisiting Adversarial Robustness Distillation: Robust Soft Labels Make Student Better
Figure 4 for Revisiting Adversarial Robustness Distillation: Robust Soft Labels Make Student Better

Adversarial training is one effective approach for training robust deep neural networks against adversarial attacks. While being able to bring reliable robustness, adversarial training (AT) methods in general favor high capacity models, i.e., the larger the model the better the robustness. This tends to limit their effectiveness on small models, which are more preferable in scenarios where storage or computing resources are very limited (e.g., mobile devices). In this paper, we leverage the concept of knowledge distillation to improve the robustness of small models by distilling from adversarially trained large models. We first revisit several state-of-the-art AT methods from a distillation perspective and identify one common technique that can lead to improved robustness: the use of robust soft labels -- predictions of a robust model. Following this observation, we propose a novel adversarial robustness distillation method called Robust Soft Label Adversarial Distillation (RSLAD) to train robust small student models. RSLAD fully exploits the robust soft labels produced by a robust (adversarially-trained) large teacher model to guide the student's learning on both natural and adversarial examples in all loss terms. We empirically demonstrate the effectiveness of our RSLAD approach over existing adversarial training and distillation methods in improving the robustness of small models against state-of-the-art attacks including the AutoAttack. We also provide a set of understandings on our RSLAD and the importance of robust soft labels for adversarial robustness distillation.

Viaarxiv icon

What Do Deep Nets Learn? Class-wise Patterns Revealed in the Input Space

Feb 06, 2021
Shihao Zhao, Xingjun Ma, Yisen Wang, James Bailey, Bo Li, Yu-Gang Jiang

Figure 1 for What Do Deep Nets Learn? Class-wise Patterns Revealed in the Input Space
Figure 2 for What Do Deep Nets Learn? Class-wise Patterns Revealed in the Input Space
Figure 3 for What Do Deep Nets Learn? Class-wise Patterns Revealed in the Input Space
Figure 4 for What Do Deep Nets Learn? Class-wise Patterns Revealed in the Input Space

Deep neural networks (DNNs) are increasingly deployed in different applications to achieve state-of-the-art performance. However, they are often applied as a black box with limited understanding of what knowledge the model has learned from the data. In this paper, we focus on image classification and propose a method to visualize and understand the class-wise knowledge (patterns) learned by DNNs under three different settings including natural, backdoor and adversarial. Different to existing visualization methods, our method searches for a single predictive pattern in the pixel space to represent the knowledge learned by the model for each class. Based on the proposed method, we show that DNNs trained on natural (clean) data learn abstract shapes along with some texture, and backdoored models learn a suspicious pattern for the backdoored class. Interestingly, the phenomenon that DNNs can learn a single predictive pattern for each class indicates that DNNs can learn a backdoor even from clean data, and the pattern itself is a backdoor trigger. In the adversarial setting, we show that adversarially trained models tend to learn more simplified shape patterns. Our method can serve as a useful tool to better understand the knowledge learned by DNNs on different datasets under different settings.

Viaarxiv icon

Clean-Label Backdoor Attacks on Video Recognition Models

Mar 06, 2020
Shihao Zhao, Xingjun Ma, Xiang Zheng, James Bailey, Jingjing Chen, Yu-Gang Jiang

Figure 1 for Clean-Label Backdoor Attacks on Video Recognition Models
Figure 2 for Clean-Label Backdoor Attacks on Video Recognition Models
Figure 3 for Clean-Label Backdoor Attacks on Video Recognition Models
Figure 4 for Clean-Label Backdoor Attacks on Video Recognition Models

Deep neural networks (DNNs) are vulnerable to backdoor attacks which can hide backdoor triggers in DNNs by poisoning training data. A backdoored model behaves normally on clean test images, yet consistently predicts a particular target class for any test examples that contain the trigger pattern. As such, backdoor attacks are hard to detect, and have raised severe security concerns in real-world applications. Thus far, backdoor research has mostly been conducted in the image domain with image classification models. In this paper, we show that existing image backdoor attacks are far less effective on videos, and outline 4 strict conditions where existing attacks are likely to fail: 1) scenarios with more input dimensions (eg. videos), 2) scenarios with high resolution, 3) scenarios with a large number of classes and few examples per class (a "sparse dataset"), and 4) attacks with access to correct labels (eg. clean-label attacks). We propose the use of a universal adversarial trigger as the backdoor trigger to attack video recognition models, a situation where backdoor attacks are likely to be challenged by the above 4 strict conditions. We show on benchmark video datasets that our proposed backdoor attack can manipulate state-of-the-art video models with high success rates by poisoning only a small proportion of training data (without changing the labels). We also show that our proposed backdoor attack is resistant to state-of-the-art backdoor defense/detection methods, and can even be applied to improve image backdoor attacks. Our proposed video backdoor attack not only serves as a strong baseline for improving the robustness of video models, but also provides a new perspective for more understanding more powerful backdoor attacks.

Viaarxiv icon