In recent years there has been increased interest in understanding the interplay between deep generative models (DGMs) and the manifold hypothesis. Research in this area focuses on understanding the reasons why commonly-used DGMs succeed or fail at learning distributions supported on unknown low-dimensional manifolds, as well as developing new models explicitly designed to account for manifold-supported data. This manifold lens provides both clarity as to why some DGMs (e.g. diffusion models and some generative adversarial networks) empirically surpass others (e.g. likelihood-based models such as variational autoencoders, normalizing flows, or energy-based models) at sample generation, and guidance for devising more performant DGMs. We carry out the first survey of DGMs viewed through this lens, making two novel contributions along the way. First, we formally establish that numerical instability of high-dimensional likelihoods is unavoidable when modelling low-dimensional data. We then show that DGMs on learned representations of autoencoders can be interpreted as approximately minimizing Wasserstein distance: this result, which applies to latent diffusion models, helps justify their outstanding empirical results. The manifold lens provides a rich perspective from which to understand DGMs, which we aim to make more accessible and widespread.
We systematically study a wide variety of image-based generative models spanning semantically-diverse datasets to understand and improve the feature extractors and metrics used to evaluate them. Using best practices in psychophysics, we measure human perception of image realism for generated samples by conducting the largest experiment evaluating generative models to date, and find that no existing metric strongly correlates with human evaluations. Comparing to 16 modern metrics for evaluating the overall performance, fidelity, diversity, and memorization of generative models, we find that the state-of-the-art perceptual realism of diffusion models as judged by humans is not reflected in commonly reported metrics such as FID. This discrepancy is not explained by diversity in generated samples, though one cause is over-reliance on Inception-V3. We address these flaws through a study of alternative self-supervised feature extractors, find that the semantic information encoded by individual networks strongly depends on their training procedure, and show that DINOv2-ViT-L/14 allows for much richer evaluation of generative models. Next, we investigate data memorization, and find that generative models do memorize training examples on simple, smaller datasets like CIFAR10, but not necessarily on more complex datasets like ImageNet. However, our experiments show that current metrics do not properly detect memorization; none in the literature is able to separate memorization from other phenomena such as underfitting or mode shrinkage. To facilitate further development of generative models and their evaluation we release all generated image datasets, human evaluation data, and a modular library to compute 16 common metrics for 8 different encoders at https://github.com/layer6ai-labs/dgm-eval.
The computational benefits of iterative non-autoregressive transformers decrease as the number of decoding steps increases. As a remedy, we introduce Distill Multiple Steps (DiMS), a simple yet effective distillation technique to decrease the number of required steps to reach a certain translation quality. The distilled model enjoys the computational benefits of early iterations while preserving the enhancements from several iterative steps. DiMS relies on two models namely student and teacher. The student is optimized to predict the output of the teacher after multiple decoding steps while the teacher follows the student via a slow-moving average. The moving average keeps the teacher's knowledge updated and enhances the quality of the labels provided by the teacher. During inference, the student is used for translation and no additional computation is added. We verify the effectiveness of DiMS on various models obtaining improvements of up to 7 BLEU points on distilled and 12 BLEU points on raw WMT datasets for single-step translation. We release our code at https://github.com/layer6ai-labs/DiMS.
We study sampling from a target distribution $\nu_* \propto e^{-f}$ using the unadjusted Langevin Monte Carlo (LMC) algorithm when the target $\nu_*$ satisfies the Poincar\'e inequality, and the potential $f$ is first-order smooth and dissipative. Under an opaque uniform warmness condition on the LMC iterates, we establish that $\widetilde{\mathcal{O}}(\epsilon^{-1})$ steps are sufficient for LMC to reach $\epsilon$ neighborhood of the target in Chi-square divergence. We hope that this note serves as a step towards establishing a complete convergence analysis of LMC under Chi-square divergence.
We study sampling from a target distribution $\nu_* \propto e^{-f}$ using the unadjusted Langevin Monte Carlo (LMC) algorithm when the target $\nu_*$ satisfies the Poincar\'e inequality and the potential $f$ is weakly smooth, i.e., $\nabla f$ is $\beta$-H\"older continuous. We prove that $\widetilde{\mathcal{O}}(\epsilon^{-1/\beta})$ steps are sufficient for LMC to reach $\epsilon$ neighborhood of the target in Chi-square divergence. We derive the dimension dependency of the convergence rate under various scenarios, where the effects of initialization and the Poincar\'e constant are particularly taken into consideration. For convex and first-order smooth potentials, if we assume the Kannan-Lov\'asz-Simonovits (KLS) conjecture, then LMC with warm-start achieves the best-known rate $\widetilde{\mathcal{O}}(d\epsilon^{-1})$ which was previously established for strongly convex potentials. In the pessimistic case when the KLS conjecture is not true, using the results of Lee and Vempala, and initializing LMC with a Gaussian, we obtain the rate $\widetilde{\mathcal{O}}(d^{3}\epsilon^{-1})$ for all smooth potentials that are convex up to finite perturbations. Translating this rate to KL divergence improves upon the best-known rate for smooth potentials that have linear tail growth. For weakly smooth potentials whose tails behave like $\|x\|^\alpha$, the regime of improvement becomes the interval $\alpha \in (1,10/7]$. Finally, as we rely on the Poincar\'e inequality, our framework covers a wide range of non-convex potentials that are weakly smooth, and have at least linear tail growth.
We study sampling from a target distribution ${\nu_* = e^{-f}}$ using the unadjusted Langevin Monte Carlo (LMC) algorithm. For any potential function $f$ whose tails behave like ${\|x\|^\alpha}$ for ${\alpha \in [1,2]}$, and has $\beta$-H\"older continuous gradient, we prove that ${\widetilde{\mathcal{O}} \Big(d^{\frac{1}{\beta}+\frac{1+\beta}{\beta}(\frac{2}{\alpha} - \boldsymbol{1}_{\{\alpha \neq 1\}})} \epsilon^{-\frac{1}{\beta}}\Big)}$ steps are sufficient to reach the $\epsilon $-neighborhood of a $d$-dimensional target distribution $\nu_*$ in KL-divergence. This convergence rate, in terms of $\epsilon$ dependency, is not directly influenced by the tail growth rate $\alpha$ of the potential function as long as its growth is at least linear, and it only relies on the order of smoothness $\beta$. One notable consequence of this result is that for potentials with Lipschitz gradient, i.e. $\beta=1$, our rate recovers the best known rate ${\widetilde{\mathcal{O}}(d\epsilon^{-1})}$ which was established for strongly convex potentials in terms of $\epsilon$ dependency, but we show that the same rate is achievable for a wider class of potentials that are degenerately convex at infinity. The growth rate $\alpha$ starts to have an effect on the established rate in high dimensions where $d$ is large; furthermore, it recovers the best-known dimension dependency when the tail growth of the potential is quadratic, i.e. ${\alpha = 2}$, in the current setup. Our framework allows for finite perturbations, and any order of smoothness ${\beta\in(0,1]}$; consequently, our results are applicable to a wide class of non-convex potentials that are weakly smooth and exhibit at least linear tail growth.