Alert button
Picture for Weiming Dong

Weiming Dong

Alert button

ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation

May 25, 2023
Yuxin Zhang, Weiming Dong, Fan Tang, Nisha Huang, Haibin Huang, Chongyang Ma, Tong-Yee Lee, Oliver Deussen, Changsheng Xu

Figure 1 for ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation
Figure 2 for ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation
Figure 3 for ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation
Figure 4 for ProSpect: Expanded Conditioning for the Personalization of Attribute-aware Image Generation

Personalizing generative models offers a way to guide image generation with user-provided references. Current personalization methods can invert an object or concept into the textual conditioning space and compose new natural sentences for text-to-image diffusion models. However, representing and editing specific visual attributes like material, style, layout, etc. remains a challenge, leading to a lack of disentanglement and editability. To address this, we propose a novel approach that leverages the step-by-step generation process of diffusion models, which generate images from low- to high-frequency information, providing a new perspective on representing, generating, and editing images. We develop Prompt Spectrum Space P*, an expanded textual conditioning space, and a new image representation method called ProSpect. ProSpect represents an image as a collection of inverted textual token embeddings encoded from per-stage prompts, where each prompt corresponds to a specific generation stage (i.e., a group of consecutive steps) of the diffusion model. Experimental results demonstrate that P* and ProSpect offer stronger disentanglement and controllability compared to existing methods. We apply ProSpect in various personalized attribute-aware image generation applications, such as image/text-guided material/style/layout transfer/editing, achieving previously unattainable results with a single image input without fine-tuning the diffusion models.

Viaarxiv icon

Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer

May 09, 2023
Nisha Huang, Yuxin Zhang, Weiming Dong

Figure 1 for Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer
Figure 2 for Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer
Figure 3 for Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer
Figure 4 for Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer

Large-scale text-to-video diffusion models have demonstrated an exceptional ability to synthesize diverse videos. However, due to the lack of extensive text-to-video datasets and the necessary computational resources for training, directly applying these models for video stylization remains difficult. Also, given that the noise addition process on the input content is random and destructive, fulfilling the style transfer task's content preservation criteria is challenging. This paper proposes a zero-shot video stylization method named Style-A-Video, which utilizes a generative pre-trained transformer with an image latent diffusion model to achieve a concise text-controlled video stylization. We improve the guidance condition in the denoising process, establishing a balance between artistic expression and structure preservation. Furthermore, to decrease inter-frame flicker and avoid the formation of additional artifacts, we employ a sampling optimization and a temporal consistency module. Extensive experiments show that we can attain superior content preservation and stylistic performance while incurring less consumption than previous solutions. Code will be available at https://github.com/haha-lisa/Style-A-Video.

Viaarxiv icon

Region-Aware Diffusion for Zero-shot Text-driven Image Editing

Feb 23, 2023
Nisha Huang, Fan Tang, Weiming Dong, Tong-Yee Lee, Changsheng Xu

Figure 1 for Region-Aware Diffusion for Zero-shot Text-driven Image Editing
Figure 2 for Region-Aware Diffusion for Zero-shot Text-driven Image Editing
Figure 3 for Region-Aware Diffusion for Zero-shot Text-driven Image Editing
Figure 4 for Region-Aware Diffusion for Zero-shot Text-driven Image Editing

Image manipulation under the guidance of textual descriptions has recently received a broad range of attention. In this study, we focus on the regional editing of images with the guidance of given text prompts. Different from current mask-based image editing methods, we propose a novel region-aware diffusion model (RDM) for entity-level image editing, which could automatically locate the region of interest and replace it following given text prompts. To strike a balance between image fidelity and inference speed, we design the intensive diffusion pipeline by combing latent space diffusion and enhanced directional guidance. In addition, to preserve image content in non-edited regions, we introduce regional-aware entity editing to modify the region of interest and preserve the out-of-interest region. We validate the proposed RDM beyond the baseline methods through extensive qualitative and quantitative experiments. The results show that RDM outperforms the previous approaches in terms of visual quality, overall harmonization, non-editing region content preservation, and text-image semantic consistency. The codes are available at https://github.com/haha-lisa/RDM-Region-Aware-Diffusion-Model.

Viaarxiv icon

Inversion-Based Creativity Transfer with Diffusion Models

Nov 23, 2022
Yuxin Zhang, Nisha Huang, Fan Tang, Haibin Huang, Chongyang Ma, Weiming Dong, Changsheng Xu

Figure 1 for Inversion-Based Creativity Transfer with Diffusion Models
Figure 2 for Inversion-Based Creativity Transfer with Diffusion Models
Figure 3 for Inversion-Based Creativity Transfer with Diffusion Models
Figure 4 for Inversion-Based Creativity Transfer with Diffusion Models

In this paper, we introduce the task of "Creativity Transfer". The artistic creativity within a painting is the means of expression, which includes not only the painting material, colors, and brushstrokes, but also the high-level attributes including semantic elements, object shape, etc. Previous arbitrary example-guided artistic image generation methods (e.g., style transfer) often fail to control shape changes or convey semantic elements. The pre-trained text-to-image synthesis diffusion probabilistic models have achieved remarkable quality, but they often require extensive textual descriptions to accurately portray attributes of a particular painting. We believe that the uniqueness of an artwork lies precisely in the fact that it cannot be adequately explained with normal language. Our key idea is to learn artistic creativity directly from a single painting and then guide the synthesis without providing complex textual descriptions. Specifically, we assume creativity as a learnable textual description of a painting. We propose an attention-based inversion method, which can efficiently and accurately learn the holistic and detailed information of an image, thus capturing the complete artistic creativity of a painting. We demonstrate the quality and efficiency of our method on numerous paintings of various artists and styles. Code and models are available at https://github.com/zyxElsa/creativity-transfer.

Viaarxiv icon

DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization

Nov 19, 2022
Nisha Huang, Yuxin Zhang, Fan Tang, Chongyang Ma, Haibin Huang, Yong Zhang, Weiming Dong, Changsheng Xu

Figure 1 for DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization
Figure 2 for DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization
Figure 3 for DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization
Figure 4 for DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization

Despite the impressive results of arbitrary image-guided style transfer methods, text-driven image stylization has recently been proposed for transferring a natural image into the stylized one according to textual descriptions of the target style provided by the user. Unlike previous image-to-image transfer approaches, text-guided stylization progress provides users with a more precise and intuitive way to express the desired style. However, the huge discrepancy between cross-modal inputs/outputs makes it challenging to conduct text-driven image stylization in a typical feed-forward CNN pipeline. In this paper, we present DiffStyler on the basis of diffusion models. The cross-modal style information can be easily integrated as guidance during the diffusion progress step-by-step. In particular, we use a dual diffusion processing architecture to control the balance between the content and style of the diffused results. Furthermore, we propose a content image-based learnable noise on which the reverse denoising process is based, enabling the stylization results to better preserve the structure information of the content image. We validate the proposed DiffStyler beyond the baseline methods through extensive qualitative and quantitative experiments.

Viaarxiv icon

Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models

Nov 14, 2022
Chengcheng Ma, Yang Liu, Jiankang Deng, Lingxi Xie, Weiming Dong, Changsheng Xu

Figure 1 for Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models
Figure 2 for Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models
Figure 3 for Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models
Figure 4 for Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models

Pre-trained Vision-Language Models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts. Instead of designing prompts manually, Context Optimization (CoOp) has been recently proposed to learn continuous prompts using task-specific training data. Despite the performance improvements on downstream tasks, several studies have reported that CoOp suffers from the overfitting issue in two aspects: (i) the test accuracy on base classes first gets better and then gets worse during training; (ii) the test accuracy on novel classes keeps decreasing. However, none of the existing studies can understand and mitigate such overfitting problem effectively. In this paper, we first explore the cause of overfitting by analyzing the gradient flow. Comparative experiments reveal that CoOp favors generalizable and spurious features in the early and later training stages respectively, leading to the non-overfitting and overfitting phenomenon. Given those observations, we propose Subspace Prompt Tuning (SubPT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process, and successfully eliminate the overfitting problem. Besides, we equip CoOp with Novel Feature Learner (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set, needless of image training data. Extensive experiments on 11 classification datasets demonstrate that SubPT+NFL consistently boost the performance of CoOp and outperform the state-of-the-art approach CoCoOp. Experiments on more challenging vision downstream tasks including open-vocabulary object detection and zero-shot semantic segmentation also verify the effectiveness of the proposed method. Codes can be found at https://tinyurl.com/mpe64f89.

Viaarxiv icon

Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion

Sep 28, 2022
Nisha Huang, Fan Tang, Weiming Dong, Changsheng Xu

Figure 1 for Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion
Figure 2 for Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion
Figure 3 for Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion
Figure 4 for Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion

Digital art synthesis is receiving increasing attention in the multimedia community because of engaging the public with art effectively. Current digital art synthesis methods usually use single-modality inputs as guidance, thereby limiting the expressiveness of the model and the diversity of generated results. To solve this problem, we propose the multimodal guided artwork diffusion (MGAD) model, which is a diffusion-based digital artwork generation approach that utilizes multimodal prompts as guidance to control the classifier-free diffusion model. Additionally, the contrastive language-image pretraining (CLIP) model is used to unify text and image modalities. Extensive experimental results on the quality and quantity of the generated digital art paintings confirm the effectiveness of the combination of the diffusion model and multimodal guidance. Code is available at https://github.com/haha-lisa/MGAD-multimodal-guided-artwork-diffusion.

* Accepted by ACM MM 2022 
Viaarxiv icon

Domain Enhanced Arbitrary Image Style Transfer via Contrastive Learning

May 20, 2022
Yuxin Zhang, Fan Tang, Weiming Dong, Haibin Huang, Chongyang Ma, Tong-Yee Lee, Changsheng Xu

Figure 1 for Domain Enhanced Arbitrary Image Style Transfer via Contrastive Learning
Figure 2 for Domain Enhanced Arbitrary Image Style Transfer via Contrastive Learning
Figure 3 for Domain Enhanced Arbitrary Image Style Transfer via Contrastive Learning
Figure 4 for Domain Enhanced Arbitrary Image Style Transfer via Contrastive Learning

In this work, we tackle the challenging problem of arbitrary image style transfer using a novel style feature representation learning method. A suitable style representation, as a key component in image stylization tasks, is essential to achieve satisfactory results. Existing deep neural network based approaches achieve reasonable results with the guidance from second-order statistics such as Gram matrix of content features. However, they do not leverage sufficient style information, which results in artifacts such as local distortions and style inconsistency. To address these issues, we propose to learn style representation directly from image features instead of their second-order statistics, by analyzing the similarities and differences between multiple styles and considering the style distribution. Specifically, we present Contrastive Arbitrary Style Transfer (CAST), which is a new style representation learning and style transfer method via contrastive learning. Our framework consists of three key components, i.e., a multi-layer style projector for style code encoding, a domain enhancement module for effective learning of style distribution, and a generative network for image style transfer. We conduct qualitative and quantitative evaluations comprehensively to demonstrate that our approach achieves significantly better results compared to those obtained via state-of-the-art methods. Code and models are available at https://github.com/zyxElsa/CAST_pytorch

* Accepted by SIGGRAPH 2022 
Viaarxiv icon