Abstract:World models paired with model predictive control (MPC) can be trained offline on large-scale datasets of expert trajectories and enable generalization to a wide range of planning tasks at inference time. Compared to traditional MPC procedures, which rely on slow search algorithms or on iteratively solving optimization problems exactly, gradient-based planning offers a computationally efficient alternative. However, the performance of gradient-based planning has thus far lagged behind that of other approaches. In this paper, we propose improved methods for training world models that enable efficient gradient-based planning. We begin with the observation that although a world model is trained on a next-state prediction objective, it is used at test-time to instead estimate a sequence of actions. The goal of our work is to close this train-test gap. To that end, we propose train-time data synthesis techniques that enable significantly improved gradient-based planning with existing world models. At test time, our approach outperforms or matches the classical gradient-free cross-entropy method (CEM) across a variety of object manipulation and navigation tasks in 10% of the time budget.




Abstract:Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries. To assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference. We show that our method outperforms prior zero-shot models for visual reasoning in 3D and empirically validate the effectiveness of our agentic framework for 3D spatial reasoning tasks. Project website: https://glab-caltech.github.io/vadar/



Abstract:Phase retrieval is the nonlinear inverse problem of recovering a true signal from its Fourier magnitude measurements. It arises in many applications such as astronomical imaging, X-Ray crystallography, microscopy, and more. The problem is highly ill-posed due to the phase-induced ambiguities and the large number of possible images that can fit to the given measurements. Thus, there's a rich history of enforcing structural priors to improve solutions including sparsity priors and deep-learning-based generative models. However, such priors are often limited in their representational capacity or generalizability to slightly different distributions. Recent advancements in using denoisers as regularizers for non-convex optimization algorithms have shown promising performance and generalization. We present a way of leveraging the prior implicitly learned by a denoiser to solve phase retrieval problems by incorporating it in a classical alternating minimization framework. Compared to performant denoising-based algorithms for phase retrieval, we showcase competitive performance with Fourier measurements on in-distribution images and notable improvement on out-of-distribution images.