Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhehan Li

Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control

Jan 29, 2026

Weidong Huang, Zhehan Li, Hangxin Liu, Biao Hou, Yao Su, Jingwen Zhang

Abstract:Reinforcement learning (RL) is widely used for humanoid control, with on-policy methods such as Proximal Policy Optimization (PPO) enabling robust training via large-scale parallel simulation and, in some cases, zero-shot deployment to real robots. However, the low sample efficiency of on-policy algorithms limits safe adaptation to new environments. Although off-policy RL and model-based RL have shown improved sample efficiency, the gap between large-scale pretraining and efficient finetuning on humanoids still exists. In this paper, we find that off-policy Soft Actor-Critic (SAC), with large-batch update and a high Update-To-Data (UTD) ratio, reliably supports large-scale pretraining of humanoid locomotion policies, achieving zero-shot deployment on real robots. For adaptation, we demonstrate that these SAC-pretrained policies can be finetuned in new environments and out-of-distribution tasks using model-based methods. Data collection in the new environment executes a deterministic policy while stochastic exploration is instead confined to a physics-informed world model. This separation mitigates the risks of random exploration during adaptation while preserving exploratory coverage for improvement. Overall, the approach couples the wall-clock efficiency of large-scale simulation during pretraining with the sample efficiency of model-based learning during fine-tuning.

* ICLR 2026

Via

Access Paper or Ask Questions

CREPES-X: Hierarchical Bearing-Distance-Inertial Direct Cooperative Relative Pose Estimation System

Dec 31, 2025

Zhehan Li, Zheng Wang, Jiadong Lu, Qi Liu, Zhiren Xun, Yue Wang, Fei Gao, Chao Xu, Yanjun Cao

Abstract:Relative localization is critical for cooperation in autonomous multi-robot systems. Existing approaches either rely on shared environmental features or inertial assumptions or suffer from non-line-of-sight degradation and outliers in complex environments. Robust and efficient fusion of inter-robot measurements such as bearings, distances, and inertials for tens of robots remains challenging. We present CREPES-X (Cooperative RElative Pose Estimation System with multiple eXtended features), a hierarchical relative localization framework that enhances speed, accuracy, and robustness under challenging conditions, without requiring any global information. CREPES-X starts with a compact hardware design: InfraRed (IR) LEDs, an IR camera, an ultra-wideband module, and an IMU housed in a cube no larger than 6cm on each side. Then CREPES-X implements a two-stage hierarchical estimator to meet different requirements, considering speed, accuracy, and robustness. First, we propose a single-frame relative estimator that provides instant relative poses for multi-robot setups through a closed-form solution and robust bearing outlier rejection. Then a multi-frame relative estimator is designed to offer accurate and robust relative states by exploring IMU pre-integration via robocentric relative kinematics with loosely- and tightly-coupled optimization. Extensive simulations and real-world experiments validate the effectiveness of CREPES-X, showing robustness to up to 90% bearing outliers, proving resilience in challenging conditions, and achieving RMSE of 0.073m and 1.817° in real-world datasets.

* 21 pages, 23 figures, journal

Via

Access Paper or Ask Questions

Mr. Virgil: Learning Multi-robot Visual-range Relative Localization

Dec 11, 2025

Si Wang, Zhehan Li, Jiadong Lu, Rong Xiong, Yanjun Cao, Yue Wang

Abstract:Ultra-wideband (UWB)-vision fusion localization has achieved extensive applications in the domain of multi-agent relative localization. The challenging matching problem between robots and visual detection renders existing methods highly dependent on identity-encoded hardware or delicate tuning algorithms. Overconfident yet erroneous matches may bring about irreversible damage to the localization system. To address this issue, we introduce Mr. Virgil, an end-to-end learning multi-robot visual-range relative localization framework, consisting of a graph neural network for data association between UWB rangings and visual detections, and a differentiable pose graph optimization (PGO) back-end. The graph-based front-end supplies robust matching results, accurate initial position predictions, and credible uncertainty estimates, which are subsequently integrated into the PGO back-end to elevate the accuracy of the final pose estimation. Additionally, a decentralized system is implemented for real-world applications. Experiments spanning varying robot numbers, simulation and real-world, occlusion and non-occlusion conditions showcase the stability and exactitude under various scenes compared to conventional methods. Our code is available at: https://github.com/HiOnes/Mr-Virgil.

* Accepted by 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Via

Access Paper or Ask Questions

M2Diffuser: Diffusion-based Trajectory Optimization for Mobile Manipulation in 3D Scenes

Oct 15, 2024

Sixu Yan, Zeyu Zhang, Muzhi Han, Zaijin Wang, Qi Xie, Zhitian Li, Zhehan Li, Hangxin Liu, Xinggang Wang, Song-Chun Zhu

Figure 1 for M2Diffuser: Diffusion-based Trajectory Optimization for Mobile Manipulation in 3D Scenes

Figure 2 for M2Diffuser: Diffusion-based Trajectory Optimization for Mobile Manipulation in 3D Scenes

Figure 3 for M2Diffuser: Diffusion-based Trajectory Optimization for Mobile Manipulation in 3D Scenes

Figure 4 for M2Diffuser: Diffusion-based Trajectory Optimization for Mobile Manipulation in 3D Scenes

Abstract:Recent advances in diffusion models have opened new avenues for research into embodied AI agents and robotics. Despite significant achievements in complex robotic locomotion and skills, mobile manipulation-a capability that requires the coordination of navigation and manipulation-remains a challenge for generative AI techniques. This is primarily due to the high-dimensional action space, extended motion trajectories, and interactions with the surrounding environment. In this paper, we introduce M2Diffuser, a diffusion-based, scene-conditioned generative model that directly generates coordinated and efficient whole-body motion trajectories for mobile manipulation based on robot-centric 3D scans. M2Diffuser first learns trajectory-level distributions from mobile manipulation trajectories provided by an expert planner. Crucially, it incorporates an optimization module that can flexibly accommodate physical constraints and task objectives, modeled as cost and energy functions, during the inference process. This enables the reduction of physical violations and execution errors at each denoising step in a fully differentiable manner. Through benchmarking on three types of mobile manipulation tasks across over 20 scenes, we demonstrate that M2Diffuser outperforms state-of-the-art neural planners and successfully transfers the generated trajectories to a real-world robot. Our evaluations underscore the potential of generative AI to enhance the generalization of traditional planning and learning-based robotic methods, while also highlighting the critical role of enforcing physical constraints for safe and robust execution.

Via

Access Paper or Ask Questions

ColAG: A Collaborative Air-Ground Framework for Perception-Limited UGVs' Navigation

Oct 20, 2023

Zhehan Li, Rui Mao, Nanhe Chen, Chao Xu, Fei Gao, Yanjun Cao

Figure 1 for ColAG: A Collaborative Air-Ground Framework for Perception-Limited UGVs' Navigation

Figure 2 for ColAG: A Collaborative Air-Ground Framework for Perception-Limited UGVs' Navigation

Figure 3 for ColAG: A Collaborative Air-Ground Framework for Perception-Limited UGVs' Navigation

Figure 4 for ColAG: A Collaborative Air-Ground Framework for Perception-Limited UGVs' Navigation

Abstract:Perception is necessary for autonomous navigation in an unknown area crowded with obstacles. It's challenging for a robot to navigate safely without any sensors that can sense the environment, resulting in a $\textit{blind}$ robot, and becomes more difficult when comes to a group of robots. However, it could be costly to equip all robots with expensive perception or SLAM systems. In this paper, we propose a novel system named $\textbf{ColAG}$, to solve the problem of autonomous navigation for a group of $\textit{blind}$ UGVs by introducing cooperation with one UAV, which is the only robot that has full perception capabilities in the group. The UAV uses SLAM for its odometry and mapping while sharing this information with UGVs via limited relative pose estimation. The UGVs plan their trajectories in the received map and predict possible failures caused by the uncertainty of its wheel odometry and unknown risky areas. The UAV dynamically schedules waypoints to prevent UGVs from collisions, formulated as a Vehicle Routing Problem with Time Windows to optimize the UAV's trajectories and minimize time when UGVs have to wait to guarantee safety. We validate our system through extensive simulation with up to 7 UGVs and real-world experiments with 3 UGVs.

* 7 pages, 8 figures, submitted to ICRA2024 conference

Via

Access Paper or Ask Questions

CoNi-MPC: Cooperative Non-inertial Frame Based Model Predictive Control

Jun 20, 2023

Baozhe Zhang, Xinwei Chen, Zhehan Li, Giovanni Beltrame, Chao Xu, Fei Gao, Yanjun Cao

Abstract:This paper presents a novel solution for UAV control in cooperative multi-robot systems, which can be used in various scenarios such as leader-following, landing on a moving base, or specific relative motion with a target. Unlike classical methods that tackle UAV control in the world frame, we directly control the UAV in the target coordinate frame, without making motion assumptions about the target. In detail, we formulate a non-linear model predictive controller of a UAV within a non-inertial frame (i.e., the target frame). The system requires the relative states (pose and velocity), the angular velocity and the accelerations of the target, which can be obtained by relative localization methods and ubiquitous MEMS IMU sensors, respectively. This framework eliminates dependencies that are vital in classical solutions, such as accurate state estimation for both the agent and target, prior knowledge of the target motion model, and continuous trajectory re-planning for some complex tasks. We have performed extensive simulations to investigate the control performance considering the varying motion characteristics of the target. Furthermore, we conducted considerable real robot experiments, employing laboratory motion-capture systems or relative localization methods implemented outdoors, to validate the applicability and feasibility of the proposed approach.

Via

Access Paper or Ask Questions

CREPES: Cooperative RElative Pose EStimation towards Real-World Multi-Robot Systems

Feb 02, 2023

Zhiren Xun, Jian Huang, Zhehan Li, Chao Xu, Fei Gao, Yanjun Cao

Figure 1 for CREPES: Cooperative RElative Pose EStimation towards Real-World Multi-Robot Systems

Figure 2 for CREPES: Cooperative RElative Pose EStimation towards Real-World Multi-Robot Systems

Figure 3 for CREPES: Cooperative RElative Pose EStimation towards Real-World Multi-Robot Systems

Figure 4 for CREPES: Cooperative RElative Pose EStimation towards Real-World Multi-Robot Systems

Abstract:Mutual localization plays a crucial role in multi-robot systems. In this work, we propose a novel system to estimate the 3D relative pose targeting real-world applications. We design and implement a compact hardware module using active infrared (IR) LEDs, an IR fish-eye camera, an ultra-wideband (UWB) module and an inertial measurement unit (IMU). By leveraging IR light communication, the system solves data association between visual detection and UWB ranging. Ranging measurements from the UWB and directional information from the camera offer relative 3D position estimation. Combining the mutual relative position with neighbors and the gravity constraints provided by IMUs, we can estimate the 3D relative pose from every single frame of sensor fusion. In addition, we design an estimator based on the error-state Kalman filter (ESKF) to enhance system accuracy and robustness. When multiple neighbors are available, a Pose Graph Optimization (PGO) algorithm is applied to further improve system accuracy. We conduct experiments in various environments, and the results show that our system outperforms state-of-the-art accuracy and robustness, especially in challenging environments.

Via

Access Paper or Ask Questions