Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Leslie Pack Kaelbling

Rational Inverse Reasoning

Aug 12, 2025

Ben Zandonati, Tomás Lozano-Pérez, Leslie Pack Kaelbling

Abstract:Humans can observe a single, imperfect demonstration and immediately generalize to very different problem settings. Robots, in contrast, often require hundreds of examples and still struggle to generalize beyond the training conditions. We argue that this limitation arises from the inability to recover the latent explanations that underpin intelligent behavior, and that these explanations can take the form of structured programs consisting of high-level goals, sub-task decomposition, and execution constraints. In this work, we introduce Rational Inverse Reasoning (RIR), a framework for inferring these latent programs through a hierarchical generative model of behavior. RIR frames few-shot imitation as Bayesian program induction: a vision-language model iteratively proposes structured symbolic task hypotheses, while a planner-in-the-loop inference scheme scores each by the likelihood of the observed demonstration under that hypothesis. This loop yields a posterior over concise, executable programs. We evaluate RIR on a suite of continuous manipulation tasks designed to test one-shot and few-shot generalization across variations in object pose, count, geometry, and layout. With as little as one demonstration, RIR infers the intended task structure and generalizes to novel settings, outperforming state-of-the-art vision-language model baselines.

Via

Access Paper or Ask Questions

Streaming Flow Policy: Simplifying diffusion$/$flow-matching policies by treating action trajectories as flow trajectories

May 28, 2025

Sunshine Jiang, Xiaolin Fang, Nicholas Roy, Tomás Lozano-Pérez, Leslie Pack Kaelbling, Siddharth Ancha

Abstract:Recent advances in diffusion$/$flow-matching policies have enabled imitation learning of complex, multi-modal action trajectories. However, they are computationally expensive because they sample a trajectory of trajectories: a diffusion$/$flow trajectory of action trajectories. They discard intermediate action trajectories, and must wait for the sampling process to complete before any actions can be executed on the robot. We simplify diffusion$/$flow policies by treating action trajectories as flow trajectories. Instead of starting from pure noise, our algorithm samples from a narrow Gaussian around the last action. Then, it incrementally integrates a velocity field learned via flow matching to produce a sequence of actions that constitute a single trajectory. This enables actions to be streamed to the robot on-the-fly during the flow sampling process, and is well-suited for receding horizon policy execution. Despite streaming, our method retains the ability to model multi-modal behavior. We train flows that stabilize around demonstration trajectories to reduce distribution shift and improve imitation learning performance. Streaming flow policy outperforms prior methods while enabling faster policy execution and tighter sensorimotor loops for learning-based robot control. Project website: https://streaming-flow-policy.github.io/

* ICRA 2025 Beyond Pick and Place Workshop

Via

Access Paper or Ask Questions

LLM-Guided Probabilistic Program Induction for POMDP Model Estimation

May 04, 2025

Aidan Curtis, Hao Tang, Thiago Veloso, Kevin Ellis, Tomás Lozano-Pérez, Leslie Pack Kaelbling

Figure 1 for LLM-Guided Probabilistic Program Induction for POMDP Model Estimation

Figure 2 for LLM-Guided Probabilistic Program Induction for POMDP Model Estimation

Figure 3 for LLM-Guided Probabilistic Program Induction for POMDP Model Estimation

Figure 4 for LLM-Guided Probabilistic Program Induction for POMDP Model Estimation

Abstract:Partially Observable Markov Decision Processes (POMDPs) model decision making under uncertainty. While there are many approaches to approximately solving POMDPs, we aim to address the problem of learning such models. In particular, we are interested in a subclass of POMDPs wherein the components of the model, including the observation function, reward function, transition function, and initial state distribution function, can be modeled as low-complexity probabilistic graphical models in the form of a short probabilistic program. Our strategy to learn these programs uses an LLM as a prior, generating candidate probabilistic programs that are then tested against the empirical distribution and adjusted through feedback. We experiment on a number of classical toy POMDP problems, simulated MiniGrid domains, and two real mobile-base robotics search domains involving partial observability. Our results show that using an LLM to guide in the construction of a low-complexity POMDP model can be more effective than tabular POMDP learning, behavior cloning, or direct LLM planning.

Via

Access Paper or Ask Questions

Seeing is Believing: Belief-Space Planning with Foundation Models as Uncertainty Estimators

Apr 04, 2025

Linfeng Zhao, Willie McClinton, Aidan Curtis, Nishanth Kumar, Tom Silver, Leslie Pack Kaelbling, Lawson L. S. Wong

Abstract:Generalizable robotic mobile manipulation in open-world environments poses significant challenges due to long horizons, complex goals, and partial observability. A promising approach to address these challenges involves planning with a library of parameterized skills, where a task planner sequences these skills to achieve goals specified in structured languages, such as logical expressions over symbolic facts. While vision-language models (VLMs) can be used to ground these expressions, they often assume full observability, leading to suboptimal behavior when the agent lacks sufficient information to evaluate facts with certainty. This paper introduces a novel framework that leverages VLMs as a perception module to estimate uncertainty and facilitate symbolic grounding. Our approach constructs a symbolic belief representation and uses a belief-space planner to generate uncertainty-aware plans that incorporate strategic information gathering. This enables the agent to effectively reason about partial observability and property uncertainty. We demonstrate our system on a range of challenging real-world tasks that require reasoning in partially observable environments. Simulated evaluations show that our approach outperforms both vanilla VLM-based end-to-end planning or VLM-based state estimation baselines by planning for and executing strategic information gathering. This work highlights the potential of VLMs to construct belief-space symbolic scene representations, enabling downstream tasks such as uncertainty-aware planning.

Via

Access Paper or Ask Questions

Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

Feb 03, 2025

Aidan Curtis, Eric Li, Michael Noseworthy, Nishad Gothoskar, Sachin Chitta, Hui Li, Leslie Pack Kaelbling, Nicole Carey

Figure 1 for Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

Figure 2 for Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

Figure 3 for Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

Figure 4 for Flow-based Domain Randomization for Learning and Sequencing Robotic Skills

Abstract:Domain randomization in reinforcement learning is an established technique for increasing the robustness of control policies trained in simulation. By randomizing environment properties during training, the learned policy can become robust to uncertainties along the randomized dimensions. While the environment distribution is typically specified by hand, in this paper we investigate automatically discovering a sampling distribution via entropy-regularized reward maximization of a normalizing-flow-based neural sampling distribution. We show that this architecture is more flexible and provides greater robustness than existing approaches that learn simpler, parameterized sampling distributions, as demonstrated in six simulated and one real-world robotics domain. Lastly, we explore how these learned sampling distributions, combined with a privileged value function, can be used for out-of-distribution detection in an uncertainty-aware multi-step manipulation planner.

Via

Access Paper or Ask Questions

Predicate Invention from Pixels via Pretrained Vision-Language Models

Dec 31, 2024

Ashay Athalye, Nishanth Kumar, Tom Silver, Yichao Liang, Tomás Lozano-Pérez, Leslie Pack Kaelbling

Figure 1 for Predicate Invention from Pixels via Pretrained Vision-Language Models

Figure 2 for Predicate Invention from Pixels via Pretrained Vision-Language Models

Figure 3 for Predicate Invention from Pixels via Pretrained Vision-Language Models

Figure 4 for Predicate Invention from Pixels via Pretrained Vision-Language Models

Abstract:Our aim is to learn to solve long-horizon decision-making problems in highly-variable, combinatorially-complex robotics domains given raw sensor input in the form of images. Previous work has shown that one way to achieve this aim is to learn a structured abstract transition model in the form of symbolic predicates and operators, and then plan within this model to solve novel tasks at test time. However, these learned models do not ground directly into pixels from just a handful of demonstrations. In this work, we propose to invent predicates that operate directly over input images by leveraging the capabilities of pretrained vision-language models (VLMs). Our key idea is that, given a set of demonstrations, a VLM can be used to propose a set of predicates that are potentially relevant for decision-making and then to determine the truth values of these predicates in both the given demonstrations and new image inputs. We build upon an existing framework for predicate invention, which generates feature-based predicates operating on object-centric states, to also generate visual predicates that operate on images. Experimentally, we show that our approach -- pix2pred -- is able to invent semantically meaningful predicates that enable generalization to novel, complex, and long-horizon tasks across two simulated robotic environments.

* Workshop on Planning in the Era of LLMs (LM4Plan @ AAAI 2025)

Via

Access Paper or Ask Questions

Functional Risk Minimization

Dec 30, 2024

Ferran Alet, Clement Gehring, Tomás Lozano-Pérez, Kenji Kawaguchi, Joshua B. Tenenbaum, Leslie Pack Kaelbling

Abstract:The field of Machine Learning has changed significantly since the 1970s. However, its most basic principle, Empirical Risk Minimization (ERM), remains unchanged. We propose Functional Risk Minimization~(FRM), a general framework where losses compare functions rather than outputs. This results in better performance in supervised, unsupervised, and RL experiments. In the FRM paradigm, for each data point $(x_i,y_i)$ there is function $f_{\theta_i}$ that fits it: $y_i = f_{\theta_i}(x_i)$. This allows FRM to subsume ERM for many common loss functions and to capture more realistic noise processes. We also show that FRM provides an avenue towards understanding generalization in the modern over-parameterized regime, as its objective can be framed as finding the simplest model that fits the training data.

Via

Access Paper or Ask Questions

One-Shot Manipulation Strategy Learning by Making Contact Analogies

Nov 14, 2024

Yuyao Liu, Jiayuan Mao, Joshua Tenenbaum, Tomás Lozano-Pérez, Leslie Pack Kaelbling

Figure 1 for One-Shot Manipulation Strategy Learning by Making Contact Analogies

Figure 2 for One-Shot Manipulation Strategy Learning by Making Contact Analogies

Figure 3 for One-Shot Manipulation Strategy Learning by Making Contact Analogies

Figure 4 for One-Shot Manipulation Strategy Learning by Making Contact Analogies

Abstract:We present a novel approach, MAGIC (manipulation analogies for generalizable intelligent contacts), for one-shot learning of manipulation strategies with fast and extensive generalization to novel objects. By leveraging a reference action trajectory, MAGIC effectively identifies similar contact points and sequences of actions on novel objects to replicate a demonstrated strategy, such as using different hooks to retrieve distant objects of different shapes and sizes. Our method is based on a two-stage contact-point matching process that combines global shape matching using pretrained neural features with local curvature analysis to ensure precise and physically plausible contact points. We experiment with three tasks including scooping, hanging, and hooking objects. MAGIC demonstrates superior performance over existing methods, achieving significant improvements in runtime speed and generalization to different object categories. Website: https://magic-2024.github.io/ .

* CoRL LEAP Workshop, 2024

Via

Access Paper or Ask Questions

SceneComplete: Open-World 3D Scene Completion in Complex Real World Environments for Robot Manipulation

Oct 31, 2024

Aditya Agarwal, Gaurav Singh, Bipasha Sen, Tomás Lozano-Pérez, Leslie Pack Kaelbling

Figure 1 for SceneComplete: Open-World 3D Scene Completion in Complex Real World Environments for Robot Manipulation

Figure 2 for SceneComplete: Open-World 3D Scene Completion in Complex Real World Environments for Robot Manipulation

Figure 3 for SceneComplete: Open-World 3D Scene Completion in Complex Real World Environments for Robot Manipulation

Figure 4 for SceneComplete: Open-World 3D Scene Completion in Complex Real World Environments for Robot Manipulation

Abstract:Careful robot manipulation in every-day cluttered environments requires an accurate understanding of the 3D scene, in order to grasp and place objects stably and reliably and to avoid mistakenly colliding with other objects. In general, we must construct such a 3D interpretation of a complex scene based on limited input, such as a single RGB-D image. We describe SceneComplete, a system for constructing a complete, segmented, 3D model of a scene from a single view. It provides a novel pipeline for composing general-purpose pretrained perception modules (vision-language, segmentation, image-inpainting, image-to-3D, and pose-estimation) to obtain high-accuracy results. We demonstrate its accuracy and effectiveness with respect to ground-truth models in a large benchmark dataset and show that its accurate whole-object reconstruction enables robust grasp proposal generation, including for a dexterous hand.

Via

Access Paper or Ask Questions

Keypoint Abstraction using Large Models for Object-Relative Imitation Learning

Oct 30, 2024

Xiaolin Fang, Bo-Ruei Huang, Jiayuan Mao, Jasmine Shone, Joshua B. Tenenbaum, Tomás Lozano-Pérez, Leslie Pack Kaelbling

Figure 1 for Keypoint Abstraction using Large Models for Object-Relative Imitation Learning

Figure 2 for Keypoint Abstraction using Large Models for Object-Relative Imitation Learning

Figure 3 for Keypoint Abstraction using Large Models for Object-Relative Imitation Learning

Figure 4 for Keypoint Abstraction using Large Models for Object-Relative Imitation Learning

Abstract:Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for capturing essential object features, and for establishing a reference frame in action prediction, enabling data-efficient learning of robot skills. However, their manual design nature and reliance on additional human labels limit their scalability. In this paper, we propose KALM, a framework that leverages large pre-trained vision-language models (LMs) to automatically generate task-relevant and cross-instance consistent keypoints. KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data. Based on the generated keypoints, we can train keypoint-conditioned policy models that predict actions in keypoint-centric frames, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes. Our method demonstrates strong performance in the real world, adapting to different tasks and environments from only a handful of demonstrations while requiring no additional labels. Website: https://kalm-il.github.io/

* CoRL LangRob Workshop, 2024

Via

Access Paper or Ask Questions