Alert button
Picture for Fan Bao

Fan Bao

Alert button

ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing

May 26, 2023
Min Zhao, Rongzhen Wang, Fan Bao, Chongxuan Li, Jun Zhu

Figure 1 for ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing
Figure 2 for ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing
Figure 3 for ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing
Figure 4 for ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing

In this paper, we present ControlVideo, a novel method for text-driven video editing. Leveraging the capabilities of text-to-image diffusion models and ControlNet, ControlVideo aims to enhance the fidelity and temporal consistency of videos that align with a given text while preserving the structure of the source video. This is achieved by incorporating additional conditions such as edge maps, fine-tuning the key-frame and temporal attention on the source video-text pair with carefully designed strategies. An in-depth exploration of ControlVideo's design is conducted to inform future research on one-shot tuning video diffusion models. Quantitatively, ControlVideo outperforms a range of competitive baselines in terms of faithfulness and consistency while still aligning with the textual prompt. Additionally, it delivers videos with high visual realism and fidelity w.r.t. the source content, demonstrating flexibility in utilizing controls containing varying degrees of source video information, and the potential for multiple control combinations. The project page is available at \href{https://ml.cs.tsinghua.edu.cn/controlvideo/}{https://ml.cs.tsinghua.edu.cn/controlvideo/}.

Viaarxiv icon

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation

May 25, 2023
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, Jun Zhu

Figure 1 for ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation
Figure 2 for ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation
Figure 3 for ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation
Figure 4 for ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation

Score distillation sampling (SDS) has shown great promise in text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models, but suffers from over-saturation, over-smoothing, and low-diversity problems. In this work, we propose to model the 3D parameter as a random variable instead of a constant as in SDS and present variational score distillation (VSD), a principled particle-based variational framework to explain and address the aforementioned issues in text-to-3D generation. We show that SDS is a special case of VSD and leads to poor samples with both small and large CFG weights. In comparison, VSD works well with various CFG weights as ancestral sampling from diffusion models and simultaneously improves the diversity and sample quality with a common CFG weight (i.e., $7.5$). We further present various improvements in the design space for text-to-3D such as distillation time schedule and density initialization, which are orthogonal to the distillation algorithm yet not well explored. Our overall approach, dubbed ProlificDreamer, can generate high rendering resolution (i.e., $512\times512$) and high-fidelity NeRF with rich structure and complex effects (e.g., smoke and drops). Further, initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and photo-realistic. Project page: https://ml.cs.tsinghua.edu.cn/prolificdreamer/

* Project page: https://ml.cs.tsinghua.edu.cn/prolificdreamer/ 
Viaarxiv icon

A Closer Look at Parameter-Efficient Tuning in Diffusion Models

Apr 12, 2023
Chendong Xiang, Fan Bao, Chongxuan Li, Hang Su, Jun Zhu

Figure 1 for A Closer Look at Parameter-Efficient Tuning in Diffusion Models
Figure 2 for A Closer Look at Parameter-Efficient Tuning in Diffusion Models
Figure 3 for A Closer Look at Parameter-Efficient Tuning in Diffusion Models
Figure 4 for A Closer Look at Parameter-Efficient Tuning in Diffusion Models

Large-scale diffusion models like Stable Diffusion are powerful and find various real-world applications while customizing such models by fine-tuning is both memory and time inefficient. Motivated by the recent progress in natural language processing, we investigate parameter-efficient tuning in large diffusion models by inserting small learnable modules (termed adapters). In particular, we decompose the design space of adapters into orthogonal factors -- the input position, the output position as well as the function form, and perform Analysis of Variance (ANOVA), a classical statistical approach for analyzing the correlation between discrete (design options) and continuous variables (evaluation metrics). Our analysis suggests that the input position of adapters is the critical factor influencing the performance of downstream tasks. Then, we carefully study the choice of the input position, and we find that putting the input position after the cross-attention block can lead to the best performance, validated by additional visualization analyses. Finally, we provide a recipe for parameter-efficient tuning in diffusion models, which is comparable if not superior to the fully fine-tuned baseline (e.g., DreamBooth) with only 0.75 \% extra parameters, across various customized tasks.

* 8pages, now our code is available at: https://github.com/Xiang-cd/unet-finetune 
Viaarxiv icon

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

Mar 12, 2023
Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu

Figure 1 for One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
Figure 2 for One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
Figure 3 for One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
Figure 4 for One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).

Viaarxiv icon

Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels

Feb 21, 2023
Zebin You, Yong Zhong, Fan Bao, Jiacheng Sun, Chongxuan Li, Jun Zhu

Figure 1 for Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels
Figure 2 for Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels
Figure 3 for Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels
Figure 4 for Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels

We propose a three-stage training strategy called dual pseudo training (DPT) for conditional image generation and classification in semi-supervised learning. First, a classifier is trained on partially labeled data and predicts pseudo labels for all data. Second, a conditional generative model is trained on all data with pseudo labels and generates pseudo images given labels. Finally, the classifier is trained on real data augmented by pseudo images with labels. We demonstrate large-scale diffusion models and semi-supervised learners benefit mutually with a few labels via DPT. In particular, on the ImageNet 256x256 generation benchmark, DPT can generate realistic, diverse, and semantically correct images with very few labels. With two (i.e., < 0.2%) and five (i.e., < 0.4%) labels per class, DPT achieves an FID of 3.44 and 3.37 respectively, outperforming strong diffusion models with full labels, such as IDDPM, CDM, ADM, and LDM. Besides, DPT outperforms competitive semi-supervised baselines substantially on ImageNet classification benchmarks with one, two, and five labels per class, achieving state-of-the-art top-1 accuracies of 59.0 (+2.8), 69.5 (+3.0), and 73.6 (+1.2) respectively.

Viaarxiv icon

Revisiting Discriminative vs. Generative Classifiers: Theory and Implications

Feb 05, 2023
Chenyu Zheng, Guoqiang Wu, Fan Bao, Yue Cao, Chongxuan Li, Jun Zhu

Figure 1 for Revisiting Discriminative vs. Generative Classifiers: Theory and Implications
Figure 2 for Revisiting Discriminative vs. Generative Classifiers: Theory and Implications
Figure 3 for Revisiting Discriminative vs. Generative Classifiers: Theory and Implications
Figure 4 for Revisiting Discriminative vs. Generative Classifiers: Theory and Implications

A large-scale deep model pre-trained on massive labeled or unlabeled data transfers well to downstream tasks. Linear evaluation freezes parameters in the pre-trained model and trains a linear classifier separately, which is efficient and attractive for transfer. However, little work has investigated the classifier in linear evaluation except for the default logistic regression. Inspired by the statistical efficiency of naive Bayes, the paper revisits the classical topic on discriminative vs. generative classifiers. Theoretically, the paper considers the surrogate loss instead of the zero-one loss in analyses and generalizes the classical results from binary cases to multiclass ones. We show that, under mild assumptions, multiclass naive Bayes requires $O(\log n)$ samples to approach its asymptotic error while the corresponding multiclass logistic regression requires $O(n)$ samples, where $n$ is the feature dimension. To establish it, we present a multiclass $\mathcal{H}$-consistency bound framework and an explicit bound for logistic loss, which are of independent interests. Simulation results on a mixture of Gaussian validate our theoretical findings. Experiments on various pre-trained deep vision models show that naive Bayes consistently converges faster as the number of data increases. Besides, naive Bayes shows promise in few-shot cases and we observe the ``two regimes'' phenomenon in pre-trained supervised models. Our code is available at https://github.com/ML-GSAI/Revisiting-Dis-vs-Gen-Classifiers.

* 57 pages 
Viaarxiv icon

Why Are Conditional Generative Models Better Than Unconditional Ones?

Dec 01, 2022
Fan Bao, Chongxuan Li, Jiacheng Sun, Jun Zhu

Figure 1 for Why Are Conditional Generative Models Better Than Unconditional Ones?
Figure 2 for Why Are Conditional Generative Models Better Than Unconditional Ones?
Figure 3 for Why Are Conditional Generative Models Better Than Unconditional Ones?
Figure 4 for Why Are Conditional Generative Models Better Than Unconditional Ones?

Extensive empirical evidence demonstrates that conditional generative models are easier to train and perform better than unconditional ones by exploiting the labels of data. So do score-based diffusion models. In this paper, we analyze the phenomenon formally and identify that the key of conditional learning is to partition the data properly. Inspired by the analyses, we propose self-conditioned diffusion models (SCDM), which is trained conditioned on indices clustered by the k-means algorithm on the features extracted by a model pre-trained in a self-supervised manner. SCDM significantly improves the unconditional model across various datasets and achieves a record-breaking FID of 3.94 on ImageNet 64x64 without labels. Besides, SCDM achieves a slightly better FID than the corresponding conditional model on CIFAR10.

Viaarxiv icon

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Nov 02, 2022
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, Jun Zhu

Figure 1 for DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
Figure 2 for DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
Figure 3 for DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
Figure 4 for DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Diffusion probabilistic models (DPMs) have achieved impressive success in high-resolution image synthesis, especially in recent large-scale text-to-image generation applications. An essential technique for improving the sample quality of DPMs is guided sampling, which usually needs a large guidance scale to obtain the best sample quality. The commonly-used fast sampler for guided sampling is DDIM, a first-order diffusion ODE solver that generally needs 100 to 250 steps for high-quality samples. Although recent works propose dedicated high-order solvers and achieve a further speedup for sampling without guidance, their effectiveness for guided sampling has not been well-tested before. In this work, we demonstrate that previous high-order fast samplers suffer from instability issues, and they even become slower than DDIM when the guidance scale grows large. To further speed up guided sampling, we propose DPM-Solver++, a high-order solver for the guided sampling of DPMs. DPM-Solver++ solves the diffusion ODE with the data prediction model and adopts thresholding methods to keep the solution matches training data distribution. We further propose a multistep variant of DPM-Solver++ to address the instability issue by reducing the effective step size. Experiments show that DPM-Solver++ can generate high-quality samples within only 15 to 20 steps for guided sampling by pixel-space and latent-space DPMs.

Viaarxiv icon

Equivariant Energy-Guided SDE for Inverse Molecular Design

Sep 30, 2022
Fan Bao, Min Zhao, Zhongkai Hao, Peiyao Li, Chongxuan Li, Jun Zhu

Figure 1 for Equivariant Energy-Guided SDE for Inverse Molecular Design
Figure 2 for Equivariant Energy-Guided SDE for Inverse Molecular Design
Figure 3 for Equivariant Energy-Guided SDE for Inverse Molecular Design
Figure 4 for Equivariant Energy-Guided SDE for Inverse Molecular Design

Inverse molecular design is critical in material science and drug discovery, where the generated molecules should satisfy certain desirable properties. In this paper, we propose equivariant energy-guided stochastic differential equations (EEGSDE), a flexible framework for controllable 3D molecule generation under the guidance of an energy function in diffusion models. Formally, we show that EEGSDE naturally exploits the geometric symmetry in 3D molecular conformation, as long as the energy function is invariant to orthogonal transformations. Empirically, under the guidance of designed energy functions, EEGSDE significantly improves the baseline on QM9, in inverse molecular design targeted to quantum properties and molecular structures. Furthermore, EEGSDE is able to generate molecules with multiple target properties by combining the corresponding energy functions linearly.

Viaarxiv icon