Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

I-Chun Arthur Liu

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

Jan 31, 2026

I-Chun Arthur Liu, Krzysztof Choromanski, Sandy Huang, Connor Schenck

Abstract:Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks.

Via

Access Paper or Ask Questions

D-CODA: Diffusion for Coordinated Dual-Arm Data Augmentation

May 08, 2025

I-Chun Arthur Liu, Jason Chen, Gaurav Sukhatme, Daniel Seita

Abstract:Learning bimanual manipulation is challenging due to its high dimensionality and tight coordination required between two arms. Eye-in-hand imitation learning, which uses wrist-mounted cameras, simplifies perception by focusing on task-relevant views. However, collecting diverse demonstrations remains costly, motivating the need for scalable data augmentation. While prior work has explored visual augmentation in single-arm settings, extending these approaches to bimanual manipulation requires generating viewpoint-consistent observations across both arms and producing corresponding action labels that are both valid and feasible. In this work, we propose Diffusion for COordinated Dual-arm Data Augmentation (D-CODA), a method for offline data augmentation tailored to eye-in-hand bimanual imitation learning that trains a diffusion model to synthesize novel, viewpoint-consistent wrist-camera images for both arms while simultaneously generating joint-space action labels. It employs constrained optimization to ensure that augmented states involving gripper-to-object contacts adhere to constraints suitable for bimanual coordination. We evaluate D-CODA on 5 simulated and 3 real-world tasks. Our results across 2250 simulation trials and 300 real-world trials demonstrate that it outperforms baselines and ablations, showing its potential for scalable data augmentation in eye-in-hand bimanual manipulation. Our project website is at: https://dcodaaug.github.io/D-CODA/.

Via

Access Paper or Ask Questions

VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation

Jul 04, 2024

I-Chun Arthur Liu, Sicheng He, Daniel Seita, Gaurav Sukhatme

Figure 1 for VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation

Figure 2 for VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation

Figure 3 for VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation

Figure 4 for VoxAct-B: Voxel-Based Acting and Stabilizing Policy for Bimanual Manipulation

Abstract:Bimanual manipulation is critical to many robotics applications. In contrast to single-arm manipulation, bimanual manipulation tasks are challenging due to higher-dimensional action spaces. Prior works leverage large amounts of data and primitive actions to address this problem, but may suffer from sample inefficiency and limited generalization across various tasks. To this end, we propose VoxAct-B, a language-conditioned, voxel-based method that leverages Vision Language Models (VLMs) to prioritize key regions within the scene and reconstruct a voxel grid. We provide this voxel grid to our bimanual manipulation policy to learn acting and stabilizing actions. This approach enables more efficient policy learning from voxels and is generalizable to different tasks. In simulation, we show that VoxAct-B outperforms strong baselines on fine-grained bimanual manipulation tasks. Furthermore, we demonstrate VoxAct-B on real-world $\texttt{Open Drawer}$ and $\texttt{Open Jar}$ tasks using two UR5s. Code, data, and videos will be available at https://voxact-b.github.io.

Via

Access Paper or Ask Questions

Bridging Action Space Mismatch in Learning from Demonstrations

Apr 07, 2023

Gautam Salhotra, I-Chun Arthur Liu, Gaurav Sukhatme

Figure 1 for Bridging Action Space Mismatch in Learning from Demonstrations

Figure 2 for Bridging Action Space Mismatch in Learning from Demonstrations

Figure 3 for Bridging Action Space Mismatch in Learning from Demonstrations

Figure 4 for Bridging Action Space Mismatch in Learning from Demonstrations

Abstract:Learning from demonstrations (LfD) methods guide learning agents to a desired solution using demonstrations from a teacher. While some LfD methods can handle small mismatches in the action spaces of the teacher and student, here we address the case where the teacher demonstrates the task in an action space that can be substantially different from that of the student -- thereby inducing a large action space mismatch. We bridge this gap with a framework, Morphological Adaptation in Imitation Learning (MAIL), that allows training an agent from demonstrations by other agents with significantly different morphologies (from the student or each other). MAIL is able to learn from suboptimal demonstrations, so long as they provide some guidance towards a desired solution. We demonstrate MAIL on challenging household cloth manipulation tasks and introduce a new DRY CLOTH task -- cloth manipulation in 3D task with obstacles. In these tasks, we train a visual control policy for a robot with one end-effector using demonstrations from a simulated agent with two end-effectors. MAIL shows up to 27% improvement over LfD and non-LfD baselines. It is deployed to a real Franka Panda robot, and can handle multiple variations in cloth properties (color, thickness, size, material) and pose (rotation and translation). We further show generalizability to transfers from n-to-m end-effectors, in the context of a simple rearrangement task.

Via

Access Paper or Ask Questions

Learning Deformable Object Manipulation from Expert Demonstrations

Jul 20, 2022

Gautam Salhotra, I-Chun Arthur Liu, Marcus Dominguez-Kuhne, Gaurav S. Sukhatme

Figure 1 for Learning Deformable Object Manipulation from Expert Demonstrations

Figure 2 for Learning Deformable Object Manipulation from Expert Demonstrations

Figure 3 for Learning Deformable Object Manipulation from Expert Demonstrations

Figure 4 for Learning Deformable Object Manipulation from Expert Demonstrations

Abstract:We present a novel Learning from Demonstration (LfD) method, Deformable Manipulation from Demonstrations (DMfD), to solve deformable manipulation tasks using states or images as inputs, given expert demonstrations. Our method uses demonstrations in three different ways, and balances the trade-off between exploring the environment online and using guidance from experts to explore high dimensional spaces effectively. We test DMfD on a set of representative manipulation tasks for a 1-dimensional rope and a 2-dimensional cloth from the SoftGym suite of tasks, each with state and image observations. Our method exceeds baseline performance by up to 12.9% for state-based tasks and up to 33.44% on image-based tasks, with comparable or better robustness to randomness. Additionally, we create two challenging environments for folding a 2D cloth using image-based observations, and set a performance benchmark for them. We deploy DMfD on a real robot with a minimal loss in normalized performance during real-world execution compared to simulation (~6%). Source code is on github.com/uscresl/dmfd

* IEEE Robotics & Automation Letters (RA-L) Oct 2022
* Accepted to IEEE Robotics & Automation Letters (RA-L) and IEEE IROS 2022. Project website: https://uscresl.github.io/dmfd

Via

Access Paper or Ask Questions

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation

Nov 11, 2021

I-Chun Arthur Liu, Shagun Uppal, Gaurav S. Sukhatme, Joseph J. Lim, Peter Englert, Youngwoon Lee

Figure 1 for Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation

Figure 2 for Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation

Figure 3 for Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation

Figure 4 for Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation

Abstract:Learning complex manipulation tasks in realistic, obstructed environments is a challenging problem due to hard exploration in the presence of obstacles and high-dimensional visual observations. Prior work tackles the exploration problem by integrating motion planning and reinforcement learning. However, the motion planner augmented policy requires access to state information, which is often not available in the real-world settings. To this end, we propose to distill a state-based motion planner augmented policy to a visual control policy via (1) visual behavioral cloning to remove the motion planner dependency along with its jittery motion, and (2) vision-based reinforcement learning with the guidance of the smoothed trajectories from the behavioral cloning agent. We evaluate our method on three manipulation tasks in obstructed environments and compare it against various reinforcement learning and imitation learning baselines. The results demonstrate that our framework is highly sample-efficient and outperforms the state-of-the-art algorithms. Moreover, coupled with domain randomization, our policy is capable of zero-shot transfer to unseen environment settings with distractors. Code and videos are available at https://clvrai.com/mopa-pd

* Published at the Conference on Robot Learning (CoRL) 2021

Via

Access Paper or Ask Questions