Alert button
Picture for Shuai Yang

Shuai Yang

Alert button

VideoBooth: Diffusion-based Video Generation with Image Prompts

Dec 01, 2023
Yuming Jiang, Tianxing Wu, Shuai Yang, Chenyang Si, Dahua Lin, Yu Qiao, Chen Change Loy, Ziwei Liu

Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward pass.

* Project page: https://vchitect.github.io/VideoBooth-project/ 
Viaarxiv icon

Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics

Nov 06, 2023
Shuai Yang, Zhifei Chen, Pengguang Chen, Xi Fang, Shu Liu, Yingcong Chen

Figure 1 for Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics
Figure 2 for Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics
Figure 3 for Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics
Figure 4 for Defect Spectrum: A Granular Look of Large-Scale Defect Datasets with Rich Semantics

Defect inspection is paramount within the closed-loop manufacturing system. However, existing datasets for defect inspection often lack precision and semantic granularity required for practical applications. In this paper, we introduce the Defect Spectrum, a comprehensive benchmark that offers precise, semantic-abundant, and large-scale annotations for a wide range of industrial defects. Building on four key industrial benchmarks, our dataset refines existing annotations and introduces rich semantic details, distinguishing multiple defect types within a single image. Furthermore, we introduce Defect-Gen, a two-stage diffusion-based generator designed to create high-quality and diverse defective images, even when working with limited datasets. The synthetic images generated by Defect-Gen significantly enhance the efficacy of defect inspection models. Overall, The Defect Spectrum dataset demonstrates its potential in defect inspection research, offering a solid platform for testing and refining advanced models.

Viaarxiv icon

Denoising Diffusion Step-aware Models

Oct 05, 2023
Shuai Yang, Yukang Chen, Luozhou Wang, Shu Liu, Yingcong Chen

Figure 1 for Denoising Diffusion Step-aware Models
Figure 2 for Denoising Diffusion Step-aware Models
Figure 3 for Denoising Diffusion Step-aware Models
Figure 4 for Denoising Diffusion Step-aware Models

Denoising Diffusion Probabilistic Models (DDPMs) have garnered popularity for data generation across various domains. However, a significant bottleneck is the necessity for whole-network computation during every step of the generative process, leading to high computational overheads. This paper presents a novel framework, Denoising Diffusion Step-aware Models (DDSM), to address this challenge. Unlike conventional approaches, DDSM employs a spectrum of neural networks whose sizes are adapted according to the importance of each generative step, as determined through evolutionary search. This step-wise network variation effectively circumvents redundant computational efforts, particularly in less critical steps, thereby enhancing the efficiency of the diffusion model. Furthermore, the step-aware design can be seamlessly integrated with other efficiency-geared diffusion models such as DDIMs and latent diffusion, thus broadening the scope of computational savings. Empirical evaluations demonstrate that DDSM achieves computational savings of 49% for CIFAR-10, 61% for CelebA-HQ, 59% for LSUN-bedroom, 71% for AFHQ, and 76% for ImageNet, all without compromising the generation quality. Our code and models will be publicly available.

Viaarxiv icon

DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields

Sep 08, 2023
Junzhe Zhang, Yushi Lan, Shuai Yang, Fangzhou Hong, Quan Wang, Chai Kiat Yeo, Ziwei Liu, Chen Change Loy

Figure 1 for DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields
Figure 2 for DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields
Figure 3 for DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields
Figure 4 for DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields

In this paper, we address the challenging problem of 3D toonification, which involves transferring the style of an artistic domain onto a target 3D face with stylized geometry and texture. Although fine-tuning a pre-trained 3D GAN on the artistic domain can produce reasonable performance, this strategy has limitations in the 3D domain. In particular, fine-tuning can deteriorate the original GAN latent space, which affects subsequent semantic editing, and requires independent optimization and storage for each new style, limiting flexibility and efficient deployment. To overcome these challenges, we propose DeformToon3D, an effective toonification framework tailored for hierarchical 3D GAN. Our approach decomposes 3D toonification into subproblems of geometry and texture stylization to better preserve the original latent space. Specifically, we devise a novel StyleField that predicts conditional 3D deformation to align a real-space NeRF to the style space for geometry stylization. Thanks to the StyleField formulation, which already handles geometry stylization well, texture stylization can be achieved conveniently via adaptive style mixing that injects information of the artistic domain into the decoder of the pre-trained 3D GAN. Due to the unique design, our method enables flexible style degree control and shape-texture-specific style swap. Furthermore, we achieve efficient training without any real-world 2D-3D training pairs but proxy samples synthesized from off-the-shelf 2D toonification models.

* ICCV 2023. Code: https://github.com/junzhezhang/DeformToon3D Project page: https://www.mmlab-ntu.com/project/deformtoon3d/ 
Viaarxiv icon

Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation

Aug 24, 2023
Yuxin Jiang, Liming Jiang, Shuai Yang, Chen Change Loy

Figure 1 for Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation
Figure 2 for Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation
Figure 3 for Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation
Figure 4 for Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation

Automatic high-quality rendering of anime scenes from complex real-world images is of significant practical value. The challenges of this task lie in the complexity of the scenes, the unique features of anime style, and the lack of high-quality datasets to bridge the domain gap. Despite promising attempts, previous efforts are still incompetent in achieving satisfactory results with consistent semantic preservation, evident stylization, and fine details. In this study, we propose Scenimefy, a novel semi-supervised image-to-image translation framework that addresses these challenges. Our approach guides the learning with structure-consistent pseudo paired data, simplifying the pure unsupervised setting. The pseudo data are derived uniquely from a semantic-constrained StyleGAN leveraging rich model priors like CLIP. We further apply segmentation-guided data selection to obtain high-quality pseudo supervision. A patch-wise contrastive style loss is introduced to improve stylization and fine details. Besides, we contribute a high-resolution anime scene dataset to facilitate future research. Our extensive experiments demonstrate the superiority of our method over state-of-the-art baselines in terms of both perceptual quality and quantitative performance.

* ICCV 2023. The first two authors contributed equally. Code: https://github.com/Yuxinn-J/Scenimefy Project page: https://yuxinn-j.github.io/projects/Scenimefy.html 
Viaarxiv icon

Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation

Jul 17, 2023
Luozhou Wang, Shuai Yang, Shu Liu, Ying-cong Chen

Figure 1 for Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation
Figure 2 for Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation
Figure 3 for Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation
Figure 4 for Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation

Conditional diffusion models have demonstrated impressive performance in image manipulation tasks. The general pipeline involves adding noise to the image and then denoising it. However, this method faces a trade-off problem: adding too much noise affects the fidelity of the image while adding too little affects its editability. This largely limits their practical applicability. In this paper, we propose a novel framework, Selective Diffusion Distillation (SDD), that ensures both the fidelity and editability of images. Instead of directly editing images with a diffusion model, we train a feedforward image manipulation network under the guidance of the diffusion model. Besides, we propose an effective indicator to select the semantic-related timestep to obtain the correct semantic guidance from the diffusion model. This approach successfully avoids the dilemma caused by the diffusion process. Our extensive experiments demonstrate the advantages of our framework. Code is released at https://github.com/AndysonYs/Selective-Diffusion-Distillation.

Viaarxiv icon

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Jun 13, 2023
Shuai Yang, Yifan Zhou, Ziwei Liu, Chen Change Loy

Figure 1 for Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
Figure 2 for Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
Figure 3 for Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation
Figure 4 for Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

* Project page: https://anonymous-31415926.github.io/ 
Viaarxiv icon

GP-UNIT: Generative Prior for Versatile Unsupervised Image-to-Image Translation

Jun 07, 2023
Shuai Yang, Liming Jiang, Ziwei Liu, Chen Change Loy

Figure 1 for GP-UNIT: Generative Prior for Versatile Unsupervised Image-to-Image Translation
Figure 2 for GP-UNIT: Generative Prior for Versatile Unsupervised Image-to-Image Translation
Figure 3 for GP-UNIT: Generative Prior for Versatile Unsupervised Image-to-Image Translation
Figure 4 for GP-UNIT: Generative Prior for Versatile Unsupervised Image-to-Image Translation

Recent advances in deep learning have witnessed many successful unsupervised image-to-image translation models that learn correspondences between two visual domains without paired data. However, it is still a great challenge to build robust mappings between various domains especially for those with drastic visual discrepancies. In this paper, we introduce a novel versatile framework, Generative Prior-guided UNsupervised Image-to-image Translation (GP-UNIT), that improves the quality, applicability and controllability of the existing translation models. The key idea of GP-UNIT is to distill the generative prior from pre-trained class-conditional GANs to build coarse-level cross-domain correspondences, and to apply the learned prior to adversarial translations to excavate fine-level correspondences. With the learned multi-level content correspondences, GP-UNIT is able to perform valid translations between both close domains and distant domains. For close domains, GP-UNIT can be conditioned on a parameter to determine the intensity of the content correspondences during translation, allowing users to balance between content and style consistency. For distant domains, semi-supervised learning is explored to guide GP-UNIT to discover accurate semantic correspondences that are hard to learn solely from the appearance. We validate the superiority of GP-UNIT over state-of-the-art translation models in robust, high-quality and diversified translations between various domains through extensive experiments.

* Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Code: https://github.com/williamyang1991/GP-UNIT Project page: https://www.mmlab-ntu.com/project/gpunit/. arXiv admin note: substantial text overlap with arXiv:2204.03641 
Viaarxiv icon

Graph Exploration Matters: Improving both individual-level and system-level diversity in WeChat Feed Recommender

May 29, 2023
Shuai Yang, Lixin Zhang, Feng Xia, Leyu Lin

Figure 1 for Graph Exploration Matters: Improving both individual-level and system-level diversity in WeChat Feed Recommender
Figure 2 for Graph Exploration Matters: Improving both individual-level and system-level diversity in WeChat Feed Recommender
Figure 3 for Graph Exploration Matters: Improving both individual-level and system-level diversity in WeChat Feed Recommender
Figure 4 for Graph Exploration Matters: Improving both individual-level and system-level diversity in WeChat Feed Recommender

There are roughly three stages in real industrial recommendation systems, candidates generation (retrieval), ranking and reranking. Individual-level diversity and system-level diversity are both important for industrial recommender systems. The former focus on each single user's experience, while the latter focus on the difference among users. Graph-based retrieval strategies are inevitably hijacked by heavy users and popular items, leading to the convergence of candidates for users and the lack of system-level diversity. Meanwhile, in the reranking phase, Determinantal Point Process (DPP) is deployed to increase individual-level diverisity. Heavily relying on the semantic information of items, DPP suffers from clickbait and inaccurate attributes. Besides, most studies only focus on one of the two levels of diversity, and ignore the mutual influence among different stages in real recommender systems. We argue that individual-level diversity and system-level diversity should be viewed as an integrated problem, and we provide an efficient and deployable solution for web-scale recommenders. Generally, we propose to employ the retrieval graph information in diversity-based reranking, by which to weaken the hidden similarity of items exposed to users, and consequently gain more graph explorations to improve the system-level diveristy. Besides, we argue that users' propensity for diversity changes over time in content feed recommendation. Therefore, with the explored graph, we also propose to capture the user's real-time personalized propensity to the diversity. We implement and deploy the combined system in WeChat App's Top Stories used by hundreds of millions of users. Offline simulations and online A/B tests show our solution can effectively improve both user engagement and system revenue.

Viaarxiv icon