Abstract:The performance of audio latent diffusion models is primarily governed by generator expressivity and the modelability of the underlying latent space. While recent research has focused primarily on the former, as well as improving the reconstruction fidelity of audio codecs, we demonstrate that latent modelability can be significantly improved through explicit factor disentanglement. We present PoDAR (Power-Disentangled Audio Representation), a framework that utilizes a randomized power augmentation and latent consistency objective to decouple signal power from invariant semantic content. This factorization makes the latent space easier to model, which both accelerates the convergence of downstream generative models and improves final overall performance. When applied to a Stable Audio 1.0 VAE with an F5-TTS generator, PoDAR achieves about a $2\times$ acceleration in convergence to match baseline performance, while increasing final speaker similarity by 0.055 and UTMOS by 0.22 on the LibriSpeech-PC dataset. Furthermore, isolating power into dedicated channels enables the application of CFG exclusively to power-invariant content, effectively extending the stable guidance regime to higher scales.
Abstract:In inverse problems, one seeks to reconstruct an image from incomplete and/or degraded measurements. Such problems arise in magnetic resonance imaging (MRI), computed tomography, deblurring, superresolution, inpainting, and other applications. It is often the case that many image hypotheses are consistent with both the measurements and prior information, and so the goal is not to recover a single ``best'' hypothesis but rather to explore the space of probable hypotheses, i.e., to sample from the posterior distribution. In this work, we propose a regularized conditional Wasserstein GAN that can generate dozens of high-quality posterior samples per second. Using quantitative evaluation metrics like conditional Fr\'{e}chet inception distance, we demonstrate that our method produces state-of-the-art posterior samples in both multicoil MRI and inpainting applications.