Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matthew Bendel

Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

May 19, 2026

Matthew Bendel, Stephen W. Bailey, Mithilesh Vaidya, Sumukh Badam, Xingzhe He

Abstract:Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence issues, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color/style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce \textbf{Anchored Tree Sampling (ATS)}: a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from $K$ sequential rollout steps to $L+1$ tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the \emph{static-camera} regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan $2.1$ $+$ VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable $\geq 40$-minute generation on LTX-$2.3$ across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.

* 30 pages, 23 figures

Via

Access Paper or Ask Questions

PoDAR: Power-Disentangled Audio Representation for Generative Modeling

May 11, 2026

Alejandro Luebs, Mithilesh Vaidya, Ishaan Kumar, Sumukh Badam, Stephen W. Bailey, Matthew Bendel, Jose Sotelo, Xingzhe He

Abstract:The performance of audio latent diffusion models is primarily governed by generator expressivity and the modelability of the underlying latent space. While recent research has focused primarily on the former, as well as improving the reconstruction fidelity of audio codecs, we demonstrate that latent modelability can be significantly improved through explicit factor disentanglement. We present PoDAR (Power-Disentangled Audio Representation), a framework that utilizes a randomized power augmentation and latent consistency objective to decouple signal power from invariant semantic content. This factorization makes the latent space easier to model, which both accelerates the convergence of downstream generative models and improves final overall performance. When applied to a Stable Audio 1.0 VAE with an F5-TTS generator, PoDAR achieves about a $2\times$ acceleration in convergence to match baseline performance, while increasing final speaker similarity by 0.055 and UTMOS by 0.22 on the LibriSpeech-PC dataset. Furthermore, isolating power into dedicated channels enables the application of CFG exclusively to power-invariant content, effectively extending the stable guidance regime to higher scales.

* 9 pages, 3 figures

Via

Access Paper or Ask Questions

A Regularized Conditional GAN for Posterior Sampling in Inverse Problems

Oct 24, 2022

Matthew Bendel, Rizwan Ahmad, Philip Schniter

Abstract:In inverse problems, one seeks to reconstruct an image from incomplete and/or degraded measurements. Such problems arise in magnetic resonance imaging (MRI), computed tomography, deblurring, superresolution, inpainting, and other applications. It is often the case that many image hypotheses are consistent with both the measurements and prior information, and so the goal is not to recover a single ``best'' hypothesis but rather to explore the space of probable hypotheses, i.e., to sample from the posterior distribution. In this work, we propose a regularized conditional Wasserstein GAN that can generate dozens of high-quality posterior samples per second. Using quantitative evaluation metrics like conditional Fr\'{e}chet inception distance, we demonstrate that our method produces state-of-the-art posterior samples in both multicoil MRI and inpainting applications.

Via

Access Paper or Ask Questions