Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Emiel Hoogeboom

Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD

Mar 20, 2026

Emiel Hoogeboom, David Ruhe, Jonathan Heek, Thomas Mensink, Tim Salimans

Abstract:It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods that can reduce sampling steps to a handful. Our method, Discrete Moment Matching Distillation (D-MMD), leverages ideas that have been highly successful in the continuous domain. Whereas previous discrete distillation methods collapse, D-MMD maintains high quality and diversity (given sufficient sampling steps). This is demonstrated on both text and image datasets. Moreover, the newly distilled generators can outperform their teachers.

Via

Access Paper or Ask Questions

Unified Latents (UL): How to train your latents

Feb 19, 2026

Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, Tim Salimans

Abstract:We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder's output noise to the prior's minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.

Via

Access Paper or Ask Questions

Model Integrity when Unlearning with T2I Diffusion Models

Nov 04, 2024

Andrea Schioppa, Emiel Hoogeboom, Jonathan Heek

Figure 1 for Model Integrity when Unlearning with T2I Diffusion Models

Figure 2 for Model Integrity when Unlearning with T2I Diffusion Models

Figure 3 for Model Integrity when Unlearning with T2I Diffusion Models

Abstract:The rapid advancement of text-to-image Diffusion Models has led to their widespread public accessibility. However these models, trained on large internet datasets, can sometimes generate undesirable outputs. To mitigate this, approximate Machine Unlearning algorithms have been proposed to modify model weights to reduce the generation of specific types of images, characterized by samples from a ``forget distribution'', while preserving the model's ability to generate other images, characterized by samples from a ``retain distribution''. While these methods aim to minimize the influence of training data in the forget distribution without extensive additional computation, we point out that they can compromise the model's integrity by inadvertently affecting generation for images in the retain distribution. Recognizing the limitations of FID and CLIPScore in capturing these effects, we introduce a novel retention metric that directly assesses the perceptual difference between outputs generated by the original and the unlearned models. We then propose unlearning algorithms that demonstrate superior effectiveness in preserving model integrity compared to existing baselines. Given their straightforward implementation, these algorithms serve as valuable benchmarks for future advancements in approximate Machine Unlearning for Diffusion Models.

Via

Access Paper or Ask Questions

Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

Oct 25, 2024

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, Tim Salimans

Figure 1 for Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

Figure 2 for Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

Figure 3 for Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

Figure 4 for Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion

Abstract:Latent diffusion models have become the popular choice for scaling up diffusion models for high resolution image synthesis. Compared to pixel-space models that are trained end-to-end, latent models are perceived to be more efficient and to produce higher image quality at high resolution. Here we challenge these notions, and show that pixel-space models can in fact be very competitive to latent approaches both in quality and efficiency, achieving 1.5 FID on ImageNet512 and new SOTA results on ImageNet128 and ImageNet256. We present a simple recipe for scaling end-to-end pixel-space diffusion models to high resolutions. 1: Use the sigmoid loss (Kingma & Gao, 2023) with our prescribed hyper-parameters. 2: Use our simplified memory-efficient architecture with fewer skip-connections. 3: Scale the model to favor processing the image at high resolution with fewer parameters, rather than using more parameters but at a lower resolution. When combining these three steps with recently proposed tricks like guidance intervals, we obtain a family of pixel-space diffusion models we call Simple Diffusion v2 (SiD2).

Via

Access Paper or Ask Questions

Multistep Distillation of Diffusion Models via Moment Matching

Jun 06, 2024

Tim Salimans, Thomas Mensink, Jonathan Heek, Emiel Hoogeboom

Abstract:We present a new method for making diffusion models faster to sample. The method distills many-step diffusion models into few-step models by matching conditional expectations of the clean data given noisy data along the sampling trajectory. Our approach extends recently proposed one-step methods to the multi-step case, and provides a new perspective by interpreting these approaches in terms of moment matching. By using up to 8 sampling steps, we obtain distilled models that outperform not only their one-step versions but also their original many-step teacher models, obtaining new state-of-the-art results on the Imagenet dataset. We also show promising results on a large text-to-image model where we achieve fast generation of high resolution images directly in image space, without needing autoencoders or upsamplers.

Via

Access Paper or Ask Questions

Semantica: An Adaptable Image-Conditioned Diffusion Model

May 23, 2024

Manoj Kumar, Neil Houlsby, Emiel Hoogeboom

Figure 1 for Semantica: An Adaptable Image-Conditioned Diffusion Model

Figure 2 for Semantica: An Adaptable Image-Conditioned Diffusion Model

Figure 3 for Semantica: An Adaptable Image-Conditioned Diffusion Model

Figure 4 for Semantica: An Adaptable Image-Conditioned Diffusion Model

Abstract:We investigate the task of adapting image generative models to different datasets without finetuneing. To this end, we introduce Semantica, an image-conditioned diffusion model capable of generating images based on the semantics of a conditioning image. Semantica is trained exclusively on web-scale image pairs, that is it receives a random image from a webpage as conditional input and models another random image from the same webpage. Our experiments highlight the expressivity of pretrained image encoders and necessity of semantic-based data filtering in achieving high-quality image generation. Once trained, it can adaptively generate new images from a dataset by simply using images from that dataset as input. We study the transfer properties of Semantica on ImageNet, LSUN Churches, LSUN Bedroom and SUN397.

Via

Access Paper or Ask Questions

Multistep Consistency Models

Mar 11, 2024

Jonathan Heek, Emiel Hoogeboom, Tim Salimans

Figure 1 for Multistep Consistency Models

Figure 2 for Multistep Consistency Models

Figure 3 for Multistep Consistency Models

Figure 4 for Multistep Consistency Models

Abstract:Diffusion models are relatively easy to train but require many steps to generate samples. Consistency models are far more difficult to train, but generate samples in a single step. In this paper we propose Multistep Consistency Models: A unification between Consistency Models (Song et al., 2023) and TRACT (Berthelot et al., 2023) that can interpolate between a consistency model and a diffusion model: a trade-off between sampling speed and sampling quality. Specifically, a 1-step consistency model is a conventional consistency model whereas we show that a $\infty$-step consistency model is a diffusion model. Multistep Consistency Models work really well in practice. By increasing the sample budget from a single step to 2-8 steps, we can train models more easily that generate higher quality samples, while retaining much of the sampling speed benefits. Notable results are 1.4 FID on Imagenet 64 in 8 step and 2.1 FID on Imagenet128 in 8 steps with consistency distillation. We also show that our method scales to a text-to-image diffusion model, generating samples that are very close to the quality of the original model.

Via

Access Paper or Ask Questions

Rolling Diffusion Models

Feb 12, 2024

David Ruhe, Jonathan Heek, Tim Salimans, Emiel Hoogeboom

Abstract:Diffusion models have recently been increasingly applied to temporal data such as video, fluid mechanics simulations, or climate data. These methods generally treat subsequent frames equally regarding the amount of noise in the diffusion process. This paper explores Rolling Diffusion: a new approach that uses a sliding window denoising process. It ensures that the diffusion process progressively corrupts through time by assigning more noise to frames that appear later in a sequence, reflecting greater uncertainty about the future as the generation process unfolds. Empirically, we show that when the temporal dynamics are complex, Rolling Diffusion is superior to standard diffusion. In particular, this result is demonstrated in a video prediction task using the Kinetics-600 video dataset and in a chaotic fluid dynamics forecasting experiment.

Via

Access Paper or Ask Questions

DORSal: Diffusion for Object-centric Representations of Scenes $\textit{et al.}$

Jun 13, 2023

Allan Jabri, Sjoerd van Steenkiste, Emiel Hoogeboom, Mehdi S. M. Sajjadi, Thomas Kipf

Abstract:Recent progress in 3D scene understanding enables scalable learning of representations across large datasets of diverse scenes. As a consequence, generalization to unseen scenes and objects, rendering novel views from just a single or a handful of input images, and controllable scene generation that supports editing, is now possible. However, training jointly on a large number of scenes typically compromises rendering quality when compared to single-scene optimized models such as NeRFs. In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree. In particular, we propose DORSal, which adapts a video diffusion architecture for 3D scene generation conditioned on object-centric slot-based representations of scenes. On both complex synthetic multi-object scenes and on the real-world large-scale Street View dataset, we show that DORSal enables scalable neural rendering of 3D scenes with object-level editing and improves upon existing approaches.

* Project page: https://www.sjoerdvansteenkiste.com/dorsal

Via

Access Paper or Ask Questions

High-Fidelity Image Compression with Score-based Generative Models

May 26, 2023

Emiel Hoogeboom, Eirikur Agustsson, Fabian Mentzer, Luca Versari, George Toderici, Lucas Theis

Figure 1 for High-Fidelity Image Compression with Score-based Generative Models

Figure 2 for High-Fidelity Image Compression with Score-based Generative Models

Figure 3 for High-Fidelity Image Compression with Score-based Generative Models

Figure 4 for High-Fidelity Image Compression with Score-based Generative Models

Abstract:Despite the tremendous success of diffusion generative models in text-to-image generation, replicating this success in the domain of image compression has proven difficult. In this paper, we demonstrate that diffusion can significantly improve perceptual quality at a given bit-rate, outperforming state-of-the-art approaches PO-ELIC and HiFiC as measured by FID score. This is achieved using a simple but theoretically motivated two-stage approach combining an autoencoder targeting MSE followed by a further score-based decoder. However, as we will show, implementation details matter and the optimal design decisions can differ greatly from typical text-to-image models.

Via

Access Paper or Ask Questions