Alert button
Picture for Huazhe Xu

Huazhe Xu

Alert button

Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning

Nov 07, 2023
Ruizhe Shi, Yuyao Liu, Yanjie Ze, Simon S. Du, Huazhe Xu

Offline reinforcement learning (RL) aims to find a near-optimal policy using pre-collected datasets. In real-world scenarios, data collection could be costly and risky; therefore, offline RL becomes particularly challenging when the in-domain data is limited. Given recent advances in Large Language Models (LLMs) and their few-shot learning prowess, this paper introduces $\textbf{La}$nguage Models for $\textbf{Mo}$tion Control ($\textbf{LaMo}$), a general framework based on Decision Transformers to effectively use pre-trained Language Models (LMs) for offline RL. Our framework highlights four crucial components: (1) Initializing Decision Transformers with sequentially pre-trained LMs, (2) employing the LoRA fine-tuning method, in contrast to full-weight fine-tuning, to combine the pre-trained knowledge from LMs and in-domain knowledge effectively, (3) using the non-linear MLP transformation instead of linear projections, to generate embeddings, and (4) integrating an auxiliary language prediction loss during fine-tuning to stabilize the LMs and retain their original abilities on languages. Empirical results indicate $\textbf{LaMo}$ achieves state-of-the-art performance in sparse-reward tasks and closes the gap between value-based offline RL methods and decision transformers in dense-reward tasks. In particular, our method demonstrates superior performance in scenarios with limited data samples. Our project website is $\href{https://lamo2023.github.io}{\text{this https URL}}$.

* 19 pages, 9 tables 
Viaarxiv icon

Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization

Nov 06, 2023
Kun Lei, Zhengmao He, Chenhao Lu, Kaizhe Hu, Yang Gao, Huazhe Xu

Combining offline and online reinforcement learning (RL) is crucial for efficient and safe learning. However, previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance. We ask: Can we achieve straightforward yet effective offline and online learning without introducing extra conservatism or regularization? In this study, we propose Uni-o4, which utilizes an on-policy objective for both offline and online learning. Owning to the alignment of objectives in two phases, the RL agent can transfer between offline and online learning seamlessly. This property enhances the flexibility of the learning paradigm, allowing for arbitrary combinations of pretraining, fine-tuning, offline, and online learning. In the offline phase, specifically, Uni-o4 leverages diverse ensemble policies to address the mismatch issues between the estimated behavior policy and the offline dataset. Through a simple offline policy evaluation (OPE) approach, Uni-o4 can achieve multi-step policy improvement safely. We demonstrate that by employing the method above, the fusion of these two paradigms can yield superior offline initialization as well as stable and rapid online fine-tuning capabilities. Through real-world robot tasks, we highlight the benefits of this paradigm for rapid deployment in challenging, previously unseen real-world environments. Additionally, through comprehensive evaluations using numerous simulated benchmarks, we substantiate that our method achieves state-of-the-art performance in both offline and offline-to-online fine-tuning learning. Our website: https://lei-kun.github.io/uni-o4/ .

* Our website: https://lei-kun.github.io/uni-o4/ 
Viaarxiv icon

DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization

Oct 30, 2023
Guowei Xu, Ruijie Zheng, Yongyuan Liang, Xiyao Wang, Zhecheng Yuan, Tianying Ji, Yu Luo, Xiaoyu Liu, Jiaxin Yuan, Pu Hua, Shuzhen Li, Yanjie Ze, Hal Daumé III, Furong Huang, Huazhe Xu

Figure 1 for DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization
Figure 2 for DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization
Figure 3 for DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization
Figure 4 for DrM: Mastering Visual Reinforcement Learning through Dormant Ratio Minimization

Visual reinforcement learning (RL) has shown promise in continuous control tasks. Despite its progress, current algorithms are still unsatisfactory in virtually every aspect of the performance such as sample efficiency, asymptotic performance, and their robustness to the choice of random seeds. In this paper, we identify a major shortcoming in existing visual RL methods that is the agents often exhibit sustained inactivity during early training, thereby limiting their ability to explore effectively. Expanding upon this crucial observation, we additionally unveil a significant correlation between the agents' inclination towards motorically inactive exploration and the absence of neuronal activity within their policy networks. To quantify this inactivity, we adopt dormant ratio as a metric to measure inactivity in the RL agent's network. Empirically, we also recognize that the dormant ratio can act as a standalone indicator of an agent's activity level, regardless of the received reward signals. Leveraging the aforementioned insights, we introduce DrM, a method that uses three core mechanisms to guide agents' exploration-exploitation trade-offs by actively minimizing the dormant ratio. Experiments demonstrate that DrM achieves significant improvements in sample efficiency and asymptotic performance with no broken seeds (76 seeds in total) across three continuous control benchmark environments, including DeepMind Control Suite, MetaWorld, and Adroit. Most importantly, DrM is the first model-free algorithm that consistently solves tasks in both the Dog and Manipulator domains from the DeepMind Control Suite as well as three dexterous hand manipulation tasks without demonstrations in Adroit, all based on pixel observations.

Viaarxiv icon

H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation

Oct 13, 2023
Yanjie Ze, Yuyao Liu, Ruizhe Shi, Jiaxin Qin, Zhecheng Yuan, Jiashun Wang, Huazhe Xu

Figure 1 for H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation
Figure 2 for H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation
Figure 3 for H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation
Figure 4 for H-InDex: Visual Reinforcement Learning with Hand-Informed Representations for Dexterous Manipulation

Human hands possess remarkable dexterity and have long served as a source of inspiration for robotic manipulation. In this work, we propose a human $\textbf{H}$and$\textbf{-In}$formed visual representation learning framework to solve difficult $\textbf{Dex}$terous manipulation tasks ($\textbf{H-InDex}$) with reinforcement learning. Our framework consists of three stages: (i) pre-training representations with 3D human hand pose estimation, (ii) offline adapting representations with self-supervised keypoint detection, and (iii) reinforcement learning with exponential moving average BatchNorm. The last two stages only modify $0.36\%$ parameters of the pre-trained representation in total, ensuring the knowledge from pre-training is maintained to the full extent. We empirically study 12 challenging dexterous manipulation tasks and find that H-InDex largely surpasses strong baseline methods and the recent visual foundation models for motor control. Code is available at https://yanjieze.com/H-InDex .

* NeurIPS 2023. Code and videos: https://yanjieze.com/H-InDex 
Viaarxiv icon

COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL

Oct 11, 2023
Xiyao Wang, Ruijie Zheng, Yanchao Sun, Ruonan Jia, Wichayaporn Wongkamjan, Huazhe Xu, Furong Huang

Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose $\texttt{COPlanner}$, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. $\texttt{COPlanner}$ leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, $\texttt{COPlanner}$ can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. $\texttt{COPlanner}$ is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with $\texttt{COPlanner}$.

* 20 pages, 12 figures 
Viaarxiv icon

GenSim: Generating Robotic Simulation Tasks via Large Language Models

Oct 02, 2023
Lirui Wang, Yiyang Ling, Zhecheng Yuan, Mohit Shridhar, Chen Bao, Yuzhe Qin, Bailin Wang, Huazhe Xu, Xiaolong Wang

Figure 1 for GenSim: Generating Robotic Simulation Tasks via Large Language Models
Figure 2 for GenSim: Generating Robotic Simulation Tasks via Large Language Models
Figure 3 for GenSim: Generating Robotic Simulation Tasks via Large Language Models
Figure 4 for GenSim: Generating Robotic Simulation Tasks via Large Language Models

Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tasks. This has made it challenging for policies trained on simulation data to demonstrate significant task-level generalization. In this paper, we propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models' (LLM) grounding and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks. Furthermore, we observe that LLMs-generated simulation programs can enhance task-level generalization significantly when used for multitask policy training. We further find that with minimal sim-to-real adaptation, the multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world and outperform baselines by 25%. See the project website (https://liruiw.github.io/gensim) for code, demos, and videos.

* See our project website (https://liruiw.github.io/gensim), demo (https://huggingface.co/spaces/Gen-Sim/Gen-Sim), and code (https://github.com/liruiw/GenSim) for visualizations and open-source models and datasets 
Viaarxiv icon

A Wearable Robotic Hand for Hand-over-Hand Imitation Learning

Sep 26, 2023
Dehao Wei, Huazhe Xu

Figure 1 for A Wearable Robotic Hand for Hand-over-Hand Imitation Learning
Figure 2 for A Wearable Robotic Hand for Hand-over-Hand Imitation Learning
Figure 3 for A Wearable Robotic Hand for Hand-over-Hand Imitation Learning
Figure 4 for A Wearable Robotic Hand for Hand-over-Hand Imitation Learning

Dexterous manipulation through imitation learning has gained significant attention in robotics research. The collection of high-quality expert data holds paramount importance when using imitation learning. The existing approaches for acquiring expert data commonly involve utilizing a data glove to capture hand motion information. However, this method suffers from limitations as the collected information cannot be directly mapped to the robotic hand due to discrepancies in their degrees of freedom or structures. Furthermore,it fails to accurately capture force feedback information between the hand and objects during the demonstration process. To overcome these challenges, this paper presents a novel solution in the form of a wearable dexterous hand, namely Hand-over-hand Imitation learning wearable RObotic Hand (HIRO Hand),which integrates expert data collection and enables the implementation of dexterous operations. This HIRO Hand empowers the operator to utilize their own tactile feedback to determine appropriate force, position, and actions, resulting in more accurate imitation of the expert's actions. We develop both non-learning and visual behavior cloning based controllers allowing HIRO Hand successfully achieves grasping and in-hand manipulation ability.

* 7 pages 
Viaarxiv icon

OTAS: Unsupervised Boundary Detection for Object-Centric Temporal Action Segmentation

Sep 12, 2023
Yuerong Li, Zhengrong Xue, Huazhe Xu

Figure 1 for OTAS: Unsupervised Boundary Detection for Object-Centric Temporal Action Segmentation
Figure 2 for OTAS: Unsupervised Boundary Detection for Object-Centric Temporal Action Segmentation
Figure 3 for OTAS: Unsupervised Boundary Detection for Object-Centric Temporal Action Segmentation
Figure 4 for OTAS: Unsupervised Boundary Detection for Object-Centric Temporal Action Segmentation

Temporal action segmentation is typically achieved by discovering the dramatic variances in global visual descriptors. In this paper, we explore the merits of local features by proposing the unsupervised framework of Object-centric Temporal Action Segmentation (OTAS). Broadly speaking, OTAS consists of self-supervised global and local feature extraction modules as well as a boundary selection module that fuses the features and detects salient boundaries for action segmentation. As a second contribution, we discuss the pros and cons of existing frame-level and boundary-level evaluation metrics. Through extensive experiments, we find OTAS is superior to the previous state-of-the-art method by $41\%$ on average in terms of our recommended F1 score. Surprisingly, OTAS even outperforms the ground-truth human annotations in the user study. Moreover, OTAS is efficient enough to allow real-time inference.

* Accepted to WACV 2024 
Viaarxiv icon

9DTact: A Compact Vision-Based Tactile Sensor for Accurate 3D Shape Reconstruction and Generalizable 6D Force Estimation

Aug 28, 2023
Changyi Lin, Han Zhang, Jikai Xu, Lei Wu, Huazhe Xu

Figure 1 for 9DTact: A Compact Vision-Based Tactile Sensor for Accurate 3D Shape Reconstruction and Generalizable 6D Force Estimation
Figure 2 for 9DTact: A Compact Vision-Based Tactile Sensor for Accurate 3D Shape Reconstruction and Generalizable 6D Force Estimation
Figure 3 for 9DTact: A Compact Vision-Based Tactile Sensor for Accurate 3D Shape Reconstruction and Generalizable 6D Force Estimation
Figure 4 for 9DTact: A Compact Vision-Based Tactile Sensor for Accurate 3D Shape Reconstruction and Generalizable 6D Force Estimation

The advancements in vision-based tactile sensors have boosted the aptitude of robots to perform contact-rich manipulation, particularly when precise positioning and contact state of the manipulated objects are crucial for successful execution. In this work, we present 9DTact, a straightforward yet versatile tactile sensor that offers 3D shape reconstruction and 6D force estimation capabilities. Conceptually, 9DTact is designed to be highly compact, robust, and adaptable to various robotic platforms. Moreover, it is low-cost and DIY-friendly, requiring minimal assembly skills. Functionally, 9DTact builds upon the optical principles of DTact and is optimized to achieve 3D shape reconstruction with enhanced accuracy and efficiency. Remarkably, we leverage the optical and deformable properties of the translucent gel so that 9DTact can perform 6D force estimation without the participation of auxiliary markers or patterns on the gel surface. More specifically, we collect a dataset consisting of approximately 100,000 image-force pairs from 175 complex objects and train a neural network to regress the 6D force, which can generalize to unseen objects. To promote the development and applications of vision-based tactile sensors, we open-source both the hardware and software of 9DTact as well as present a 1-hour video tutorial.

* Project Website: https://linchangyi1.github.io/9DTact/ 
Viaarxiv icon