Alert button
Picture for Cheng Lu

Cheng Lu

Alert button

The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing

Nov 02, 2023
Shen Nie, Hanzhong Allan Guo, Cheng Lu, Yuhao Zhou, Chenyu Zheng, Chongxuan Li

Figure 1 for The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing
Figure 2 for The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing
Figure 3 for The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing
Figure 4 for The Blessing of Randomness: SDE Beats ODE in General Diffusion-based Image Editing

We present a unified probabilistic formulation for diffusion-based image editing, where a latent variable is edited in a task-specific manner and generally deviates from the corresponding marginal distribution induced by the original stochastic or ordinary differential equation (SDE or ODE). Instead, it defines a corresponding SDE or ODE for editing. In the formulation, we prove that the Kullback-Leibler divergence between the marginal distributions of the two SDEs gradually decreases while that for the ODEs remains as the time approaches zero, which shows the promise of SDE in image editing. Inspired by it, we provide the SDE counterparts for widely used ODE baselines in various tasks including inpainting and image-to-image translation, where SDE shows a consistent and substantial improvement. Moreover, we propose SDE-Drag -- a simple yet effective method built upon the SDE formulation for point-based content dragging. We build a challenging benchmark (termed DragBench) with open-set natural, art, and AI-generated images for evaluation. A user study on DragBench indicates that SDE-Drag significantly outperforms our ODE baseline, existing diffusion-based methods, and the renowned DragGAN. Our results demonstrate the superiority and versatility of SDE in image editing and push the boundary of diffusion-based editing methods.

Viaarxiv icon

Gaussian Mixture Solvers for Diffusion Models

Nov 02, 2023
Hanzhong Guo, Cheng Lu, Fan Bao, Tianyu Pang, Shuicheng Yan, Chao Du, Chongxuan Li

Recently, diffusion models have achieved great success in generative tasks. Sampling from diffusion models is equivalent to solving the reverse diffusion stochastic differential equations (SDEs) or the corresponding probability flow ordinary differential equations (ODEs). In comparison, SDE-based solvers can generate samples of higher quality and are suited for image translation tasks like stroke-based synthesis. During inference, however, existing SDE-based solvers are severely constrained by the efficiency-effectiveness dilemma. Our investigation suggests that this is because the Gaussian assumption in the reverse transition kernel is frequently violated (even in the case of simple mixture data) given a limited number of discretization steps. To overcome this limitation, we introduce a novel class of SDE-based solvers called \emph{Gaussian Mixture Solvers (GMS)} for diffusion models. Our solver estimates the first three-order moments and optimizes the parameters of a Gaussian mixture transition kernel using generalized methods of moments in each step during sampling. Empirically, our solver outperforms numerous SDE-based solvers in terms of sample quality in image generation and stroke-based synthesis in various diffusion models, which validates the motivation and effectiveness of GMS. Our code is available at https://github.com/Guohanzhong/GMS.

* NeurIPS 2023 
Viaarxiv icon

DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics

Oct 28, 2023
Kaiwen Zheng, Cheng Lu, Jianfei Chen, Jun Zhu

Diffusion probabilistic models (DPMs) have exhibited excellent performance for high-fidelity image generation while suffering from inefficient sampling. Recent works accelerate the sampling procedure by proposing fast ODE solvers that leverage the specific ODE form of DPMs. However, they highly rely on specific parameterization during inference (such as noise/data prediction), which might not be the optimal choice. In this work, we propose a novel formulation towards the optimal parameterization during sampling that minimizes the first-order discretization error of the ODE solution. Based on such formulation, we propose DPM-Solver-v3, a new fast ODE solver for DPMs by introducing several coefficients efficiently computed on the pretrained model, which we call empirical model statistics. We further incorporate multistep methods and a predictor-corrector framework, and propose some techniques for improving sample quality at small numbers of function evaluations (NFE) or large guidance scales. Experiments show that DPM-Solver-v3 achieves consistently better or comparable performance in both unconditional and conditional sampling with both pixel-space and latent-space DPMs, especially in 5$\sim$10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) on unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable Diffusion, bringing a speed-up of 15%$\sim$30% compared to previous state-of-the-art training-free methods. Code is available at https://github.com/thu-ml/DPM-Solver-v3.

* Accepted at NeurIPS 2023 
Viaarxiv icon

Score Regularized Policy Optimization through Diffusion Behavior

Oct 12, 2023
Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, Jun Zhu

Figure 1 for Score Regularized Policy Optimization through Diffusion Behavior
Figure 2 for Score Regularized Policy Optimization through Diffusion Behavior
Figure 3 for Score Regularized Policy Optimization through Diffusion Behavior
Figure 4 for Score Regularized Policy Optimization through Diffusion Behavior

Recent developments in offline reinforcement learning have uncovered the immense potential of diffusion modeling, which excels at representing heterogeneous behavior policies. However, sampling from diffusion policies is considerably slow because it necessitates tens to hundreds of iterative inference steps for one action. To address this issue, we propose to extract an efficient deterministic inference policy from critic models and pretrained diffusion behavior models, leveraging the latter to directly regularize the policy gradient with the behavior distribution's score function during optimization. Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme, both during training and evaluation. Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks, while still maintaining state-of-the-art performance.

* 18 pages 
Viaarxiv icon

Learning to Rank Onset-Occurring-Offset Representations for Micro-Expression Recognition

Oct 07, 2023
Jie Zhu, Yuan Zong, Jingang Shi, Cheng Lu, Hongli Chang, Wenming Zheng

Figure 1 for Learning to Rank Onset-Occurring-Offset Representations for Micro-Expression Recognition
Figure 2 for Learning to Rank Onset-Occurring-Offset Representations for Micro-Expression Recognition
Figure 3 for Learning to Rank Onset-Occurring-Offset Representations for Micro-Expression Recognition
Figure 4 for Learning to Rank Onset-Occurring-Offset Representations for Micro-Expression Recognition

This paper focuses on the research of micro-expression recognition (MER) and proposes a flexible and reliable deep learning method called learning to rank onset-occurring-offset representations (LTR3O). The LTR3O method introduces a dynamic and reduced-size sequence structure known as 3O, which consists of onset, occurring, and offset frames, for representing micro-expressions (MEs). This structure facilitates the subsequent learning of ME-discriminative features. A noteworthy advantage of the 3O structure is its flexibility, as the occurring frame is randomly extracted from the original ME sequence without the need for accurate frame spotting methods. Based on the 3O structures, LTR3O generates multiple 3O representation candidates for each ME sample and incorporates well-designed modules to measure and calibrate their emotional expressiveness. This calibration process ensures that the distribution of these candidates aligns with that of macro-expressions (MaMs) over time. Consequently, the visibility of MEs can be implicitly enhanced, facilitating the reliable learning of more discriminative features for MER. Extensive experiments were conducted to evaluate the performance of LTR3O using three widely-used ME databases: CASME II, SMIC, and SAMM. The experimental results demonstrate the effectiveness and superior performance of LTR3O, particularly in terms of its flexibility and reliability, when compared to recent state-of-the-art MER methods.

Viaarxiv icon

Layer-Adapted Implicit Distribution Alignment Networks for Cross-Corpus Speech Emotion Recognition

Oct 06, 2023
Yan Zhao, Yuan Zong, Jincen Wang, Hailun Lian, Cheng Lu, Li Zhao, Wenming Zheng

Figure 1 for Layer-Adapted Implicit Distribution Alignment Networks for Cross-Corpus Speech Emotion Recognition
Figure 2 for Layer-Adapted Implicit Distribution Alignment Networks for Cross-Corpus Speech Emotion Recognition
Figure 3 for Layer-Adapted Implicit Distribution Alignment Networks for Cross-Corpus Speech Emotion Recognition
Figure 4 for Layer-Adapted Implicit Distribution Alignment Networks for Cross-Corpus Speech Emotion Recognition

In this paper, we propose a new unsupervised domain adaptation (DA) method called layer-adapted implicit distribution alignment networks (LIDAN) to address the challenge of cross-corpus speech emotion recognition (SER). LIDAN extends our previous ICASSP work, deep implicit distribution alignment networks (DIDAN), whose key contribution lies in the introduction of a novel regularization term called implicit distribution alignment (IDA). This term allows DIDAN trained on source (training) speech samples to remain applicable to predicting emotion labels for target (testing) speech samples, regardless of corpus variance in cross-corpus SER. To further enhance this method, we extend IDA to layer-adapted IDA (LIDA), resulting in LIDAN. This layer-adpated extention consists of three modified IDA terms that consider emotion labels at different levels of granularity. These terms are strategically arranged within different fully connected layers in LIDAN, aligning with the increasing emotion-discriminative abilities with respect to the layer depth. This arrangement enables LIDAN to more effectively learn emotion-discriminative and corpus-invariant features for SER across various corpora compared to DIDAN. It is also worthy to mention that unlike most existing methods that rely on estimating statistical moments to describe pre-assumed explicit distributions, both IDA and LIDA take a different approach. They utilize an idea of target sample reconstruction to directly bridge the feature distribution gap without making assumptions about their distribution type. As a result, DIDAN and LIDAN can be viewed as implicit cross-corpus SER methods. To evaluate LIDAN, we conducted extensive cross-corpus SER experiments on EmoDB, eNTERFACE, and CASIA corpora. The experimental results demonstrate that LIDAN surpasses recent state-of-the-art explicit unsupervised DA methods in tackling cross-corpus SER tasks.

Viaarxiv icon

Time-Frequency Transformer: A Novel Time Frequency Joint Learning Method for Speech Emotion Recognition

Aug 28, 2023
Yong Wang, Cheng Lu, Yuan Zong, Hailun Lian, Yan Zhao, Sunan Li

Figure 1 for Time-Frequency Transformer: A Novel Time Frequency Joint Learning Method for Speech Emotion Recognition
Figure 2 for Time-Frequency Transformer: A Novel Time Frequency Joint Learning Method for Speech Emotion Recognition
Figure 3 for Time-Frequency Transformer: A Novel Time Frequency Joint Learning Method for Speech Emotion Recognition
Figure 4 for Time-Frequency Transformer: A Novel Time Frequency Joint Learning Method for Speech Emotion Recognition

In this paper, we propose a novel time-frequency joint learning method for speech emotion recognition, called Time-Frequency Transformer. Its advantage is that the Time-Frequency Transformer can excavate global emotion patterns in the time-frequency domain of speech signal while modeling the local emotional correlations in the time domain and frequency domain respectively. For the purpose, we first design a Time Transformer and Frequency Transformer to capture the local emotion patterns between frames and inside frequency bands respectively, so as to ensure the integrity of the emotion information modeling in both time and frequency domains. Then, a Time-Frequency Transformer is proposed to mine the time-frequency emotional correlations through the local time-domain and frequency-domain emotion features for learning more discriminative global speech emotion representation. The whole process is a time-frequency joint learning process implemented by a series of Transformer models. Experiments on IEMOCAP and CASIA databases indicate that our proposed method outdoes the state-of-the-art methods.

* Accepted by International Conference on Neural Information Processing (ICONIP2023) 
Viaarxiv icon