Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yu She

AgenticLab: A Real-World Robot Agent Platform that Can See, Think, and Act

Feb 02, 2026

Pengyuan Guo, Zhonghao Mai, Zhengtong Xu, Kaidi Zhang, Heng Zhang, Zichen Miao, Arash Ajoudani, Zachary Kingston, Qiang Qiu, Yu She

Abstract:Recent advances in large vision-language models (VLMs) have demonstrated generalizable open-vocabulary perception and reasoning, yet their real-robot manipulation capability remains unclear for long-horizon, closed-loop execution in unstructured, in-the-wild environments. Prior VLM-based manipulation pipelines are difficult to compare across different research groups' setups, and many evaluations rely on simulation, privileged state, or specially designed setups. We present AgenticLab, a model-agnostic robot agent platform and benchmark for open-world manipulation. AgenticLab provides a closed-loop agent pipeline for perception, task decomposition, online verification, and replanning. Using AgenticLab, we benchmark state-of-the-art VLM-based agents on real-robot tasks in unstructured environments. Our benchmark reveals several failure modes that offline vision-language tests (e.g., VQA and static image understanding) fail to capture, including breakdowns in multi-step grounding consistency, object grounding under occlusion and scene changes, and insufficient spatial reasoning for reliable manipulation. We will release the full hardware and software stack to support reproducible evaluation and accelerate research on general-purpose robot agents.

Via

Access Paper or Ask Questions

EquiForm: Noise-Robust SE(3)-Equivariant Policy Learning from 3D Point Clouds

Jan 24, 2026

Zhiyuan Zhang, Yu She

Abstract:Visual imitation learning with 3D point clouds has advanced robotic manipulation by providing geometry-aware, appearance-invariant observations. However, point cloud-based policies remain highly sensitive to sensor noise, pose perturbations, and occlusion-induced artifacts, which distort geometric structure and break the equivariance assumptions required for robust generalization. Existing equivariant approaches primarily encode symmetry constraints into neural architectures, but do not explicitly correct noise-induced geometric deviations or enforce equivariant consistency in learned representations. We introduce EquiForm, a noise-robust SE(3)-equivariant policy learning framework for point cloud-based manipulation. EquiForm formalizes how noise-induced geometric distortions lead to equivariance deviations in observation-to-action mappings, and introduces a geometric denoising module to restore consistent 3D structure under noisy or incomplete observations. In addition, we propose a contrastive equivariant alignment objective that enforces representation consistency under both rigid transformations and noise perturbations. Built upon these components, EquiForm forms a flexible policy learning pipeline that integrates noise-robust geometric reasoning with modern generative models. We evaluate EquiForm on 16 simulated tasks and 4 real-world manipulation tasks across diverse objects and scene layouts. Compared to state-of-the-art point cloud imitation learning methods, EquiForm achieves an average improvement of 17.2% in simulation and 28.1% in real-world experiments, demonstrating strong noise robustness and spatial generalization.

* Project website: https://ZhangZhiyuanZhang.github.io/equiform-website/ Code will be released

Via

Access Paper or Ask Questions

Safe Learning for Contact-Rich Robot Tasks: A Survey from Classical Learning-Based Methods to Safe Foundation Models

Dec 10, 2025

Heng Zhang, Rui Dai, Gokhan Solak, Pokuang Zhou, Yu She, Arash Ajoudani

Abstract:Contact-rich tasks pose significant challenges for robotic systems due to inherent uncertainty, complex dynamics, and the high risk of damage during interaction. Recent advances in learning-based control have shown great potential in enabling robots to acquire and generalize complex manipulation skills in such environments, but ensuring safety, both during exploration and execution, remains a critical bottleneck for reliable real-world deployment. This survey provides a comprehensive overview of safe learning-based methods for robot contact-rich tasks. We categorize existing approaches into two main domains: safe exploration and safe execution. We review key techniques, including constrained reinforcement learning, risk-sensitive optimization, uncertainty-aware modeling, control barrier functions, and model predictive safety shields, and highlight how these methods incorporate prior knowledge, task structure, and online adaptation to balance safety and efficiency. A particular emphasis of this survey is on how these safe learning principles extend to and interact with emerging robotic foundation models, especially vision-language models (VLMs) and vision-language-action models (VLAs), which unify perception, language, and control for contact-rich manipulation. We discuss both the new safety opportunities enabled by VLM/VLA-based methods, such as language-level specification of constraints and multimodal grounding of safety signals, and the amplified risks and evaluation challenges they introduce. Finally, we outline current limitations and promising future directions toward deploying reliable, safety-aligned, and foundation-model-enabled robots in complex contact-rich environments. More details and materials are available at our \href{ https://github.com/jack-sherman01/Awesome-Learning4Safe-Contact-rich-tasks}{Project GitHub Repository}.

Via

Access Paper or Ask Questions

Canonical Policy: Learning Canonical 3D Representation for Equivariant Policy

May 24, 2025

Zhiyuan Zhang, Zhengtong Xu, Jai Nanda Lakamsani, Yu She

Figure 1 for Canonical Policy: Learning Canonical 3D Representation for Equivariant Policy

Figure 2 for Canonical Policy: Learning Canonical 3D Representation for Equivariant Policy

Figure 3 for Canonical Policy: Learning Canonical 3D Representation for Equivariant Policy

Figure 4 for Canonical Policy: Learning Canonical 3D Representation for Equivariant Policy

Abstract:Visual Imitation learning has achieved remarkable progress in robotic manipulation, yet generalization to unseen objects, scene layouts, and camera viewpoints remains a key challenge. Recent advances address this by using 3D point clouds, which provide geometry-aware, appearance-invariant representations, and by incorporating equivariance into policy architectures to exploit spatial symmetries. However, existing equivariant approaches often lack interpretability and rigor due to unstructured integration of equivariant components. We introduce canonical policy, a principled framework for 3D equivariant imitation learning that unifies 3D point cloud observations under a canonical representation. We first establish a theory of 3D canonical representations, enabling equivariant observation-to-action mappings by grouping both in-distribution and out-of-distribution point clouds to a canonical representation. We then propose a flexible policy learning pipeline that leverages geometric symmetries from canonical representation and the expressiveness of modern generative models. We validate canonical policy on 12 diverse simulated tasks and 4 real-world manipulation tasks across 16 configurations, involving variations in object color, shape, camera viewpoint, and robot platform. Compared to state-of-the-art imitation learning policies, canonical policy achieves an average improvement of 18.0% in simulation and 37.6% in real-world experiments, demonstrating superior generalization capability and sample efficiency. For more details, please refer to the project website: https://zhangzhiyuanzhang.github.io/cp-website/.

Via

Access Paper or Ask Questions

ManiFeel: Benchmarking and Understanding Visuotactile Manipulation Policy Learning

May 24, 2025

Quan Khanh Luu, Pokuang Zhou, Zhengtong Xu, Zhiyuan Zhang, Qiang Qiu, Yu She

Abstract:Supervised visuomotor policies have shown strong performance in robotic manipulation but often struggle in tasks with limited visual input, such as operations in confined spaces, dimly lit environments, or scenarios where perceiving the object's properties and state is critical for task success. In such cases, tactile feedback becomes essential for manipulation. While the rapid progress of supervised visuomotor policies has benefited greatly from high-quality, reproducible simulation benchmarks in visual imitation, the visuotactile domain still lacks a similarly comprehensive and reliable benchmark for large-scale and rigorous evaluation. To address this, we introduce ManiFeel, a reproducible and scalable simulation benchmark for studying supervised visuotactile manipulation policies across a diverse set of tasks and scenarios. ManiFeel presents a comprehensive benchmark suite spanning a diverse set of manipulation tasks, evaluating various policies, input modalities, and tactile representation methods. Through extensive experiments, our analysis reveals key factors that influence supervised visuotactile policy learning, identifies the types of tasks where tactile sensing is most beneficial, and highlights promising directions for future research in visuotactile policy learning. ManiFeel aims to establish a reproducible benchmark for supervised visuotactile policy learning, supporting progress in visuotactile manipulation and perception. To facilitate future research and ensure reproducibility, we will release our codebase, datasets, training logs, and pretrained checkpoints. Please visit the project website for more details: https://zhengtongxu.github.io/manifeel-website/

Via

Access Paper or Ask Questions

ChicGrasp: Imitation-Learning based Customized Dual-Jaw Gripper Control for Delicate, Irregular Bio-products Manipulation

May 13, 2025

Amirreza Davar, Zhengtong Xu, Siavash Mahmoudi, Pouya Sohrabipour, Chaitanya Pallerla, Yu She, Wan Shou, Philip Crandall, Dongyi Wang

Abstract:Automated poultry processing lines still rely on humans to lift slippery, easily bruised carcasses onto a shackle conveyor. Deformability, anatomical variance, and strict hygiene rules make conventional suction and scripted motions unreliable. We present ChicGrasp, an end--to--end hardware--software co-design for this task. An independently actuated dual-jaw pneumatic gripper clamps both chicken legs, while a conditional diffusion-policy controller, trained from only 50 multi--view teleoperation demonstrations (RGB + proprioception), plans 5 DoF end--effector motion, which includes jaw commands in one shot. On individually presented raw broiler carcasses, our system achieves a 40.6\% grasp--and--lift success rate and completes the pick to shackle cycle in 38 s, whereas state--of--the--art implicit behaviour cloning (IBC) and LSTM-GMM baselines fail entirely. All CAD, code, and datasets will be open-source. ChicGrasp shows that imitation learning can bridge the gap between rigid hardware and variable bio--products, offering a reproducible benchmark and a public dataset for researchers in agricultural engineering and robot learning.

* Submitted for journal review

Via

Access Paper or Ask Questions

DiffOG: Differentiable Policy Trajectory Optimization with Generalizability

Apr 18, 2025

Zhengtong Xu, Zichen Miao, Qiang Qiu, Zhe Zhang, Yu She

Abstract:Imitation learning-based visuomotor policies excel at manipulation tasks but often produce suboptimal action trajectories compared to model-based methods. Directly mapping camera data to actions via neural networks can result in jerky motions and difficulties in meeting critical constraints, compromising safety and robustness in real-world deployment. For tasks that require high robustness or strict adherence to constraints, ensuring trajectory quality is crucial. However, the lack of interpretability in neural networks makes it challenging to generate constraint-compliant actions in a controlled manner. This paper introduces differentiable policy trajectory optimization with generalizability (DiffOG), a learning-based trajectory optimization framework designed to enhance visuomotor policies. By leveraging the proposed differentiable formulation of trajectory optimization with transformer, DiffOG seamlessly integrates policies with a generalizable optimization layer. Visuomotor policies enhanced by DiffOG generate smoother, constraint-compliant action trajectories in a more interpretable way. DiffOG exhibits strong generalization capabilities and high flexibility. We evaluated DiffOG across 11 simulated tasks and 2 real-world tasks. The results demonstrate that DiffOG significantly enhances the trajectory quality of visuomotor policies while having minimal impact on policy performance, outperforming trajectory processing baselines such as greedy constraint clipping and penalty-based trajectory optimization. Furthermore, DiffOG achieves superior performance compared to existing constrained visuomotor policy.

Via

Access Paper or Ask Questions

VILP: Imitation Learning with Latent Video Planning

Feb 03, 2025

Zhengtong Xu, Qiang Qiu, Yu She

Abstract:In the era of generative AI, integrating video generation models into robotics opens new possibilities for the general-purpose robot agent. This paper introduces imitation learning with latent video planning (VILP). We propose a latent video diffusion model to generate predictive robot videos that adhere to temporal consistency to a good degree. Our method is able to generate highly time-aligned videos from multiple views, which is crucial for robot policy learning. Our video generation model is highly time-efficient. For example, it can generate videos from two distinct perspectives, each consisting of six frames with a resolution of 96x160 pixels, at a rate of 5 Hz. In the experiments, we demonstrate that VILP outperforms the existing video generation robot policy across several metrics: training costs, inference speed, temporal consistency of generated videos, and the performance of the policy. We also compared our method with other imitation learning methods. Our findings indicate that VILP can rely less on extensive high-quality task-specific robot action data while still maintaining robust performance. In addition, VILP possesses robust capabilities in representing multi-modal action distributions. Our paper provides a practical example of how to effectively integrate video generation models into robot policies, potentially offering insights for related fields and directions. For more details, please refer to our open-source repository https://github.com/ZhengtongXu/VILP.

Via

Access Paper or Ask Questions

Feelit: Combining Compliant Shape Displays with Vision-Based Tactile Sensors for Real-Time Teletaction

Aug 28, 2024

Oscar Yu, Yu She

Abstract:Teletaction, the transmission of tactile feedback or touch, is a crucial aspect in the field of teleoperation. High-quality teletaction feedback allows users to remotely manipulate objects and increase the quality of the human-machine interface between the operator and the robot, making complex manipulation tasks possible. Advances in the field of teletaction for teleoperation however, have yet to make full use of the high-resolution 3D data provided by modern vision-based tactile sensors. Existing solutions for teletaction lack in one or more areas of form or function, such as fidelity or hardware footprint. In this paper, we showcase our design for a low-cost teletaction device that can utilize real-time high-resolution tactile information from vision-based tactile sensors, through both physical 3D surface reconstruction and shear displacement. We present our device, the Feelit, which uses a combination of a pin-based shape display and compliant mechanisms to accomplish this task. The pin-based shape display utilizes an array of 24 servomotors with miniature Bowden cables, giving the device a resolution of 6x4 pins in a 15x10 mm display footprint. Each pin can actuate up to 3 mm in 200 ms, while providing 80 N of force and 1.5 um of depth resolution. Shear displacement and rotation is achieved using a compliant mechanism design, allowing a minimum of 1 mm displacement laterally and 10 degrees of rotation. This real-time 3D tactile reconstruction is achieved with the use of a vision-based tactile sensor, the GelSight [1], along with an algorithm that samples the depth data and marker tracking to generate actuator commands. Through a series of experiments including shape recognition and relative weight identification, we show that our device has the potential to expand teletaction capabilities in the teleoperation space.

* IROS 2024

Via

Access Paper or Ask Questions

SeeBelow: Sub-dermal 3D Reconstruction of Tumors with Surgical Robotic Palpation and Tactile Exploration

Aug 25, 2024

Raghava Uppuluri, Abhinaba Bhattacharjee, Sohel Anwar, Yu She

Abstract:Surgical scene understanding in Robot-assisted Minimally Invasive Surgery (RMIS) is highly reliant on visual cues and lacks tactile perception. Force-modulated surgical palpation with tactile feedback is necessary for localization, geometry/depth estimation, and dexterous exploration of abnormal stiff inclusions in subsurface tissue layers. Prior works explored surface-level tissue abnormalities or single layered tissue-tumor embeddings with more than 300 palpations for dense 2D stiffness mapping. Our approach focuses on 3D reconstructions of sub-dermal tumor surface profiles in multi-layered tissue (skin-fat-muscle) using a visually-guided novel tactile navigation policy. A robotic palpation probe with tri-axial force sensing was leveraged for tactile exploration of the phantom. From a surface mesh of the surgical region initialized from a depth camera, the policy explores a surgeon's region of interest through palpation, sampled from bayesian optimization. Each palpation includes contour following using a contact-safe impedance controller to trace the sub-dermal tumor geometry, until the underlying tumor-tissue boundary is reached. Projections of these contour following palpation trajectories allows 3D reconstruction of the subdermal tumor surface profile in less than 100 palpations. Our approach generates high-fidelity 3D surface reconstructions of rigid tumor embeddings in tissue layers with isotropic elasticities, although soft tumor geometries are yet to be explored.

* 8 pages, 6 figures, accepted to IROS 2024

Via

Access Paper or Ask Questions