Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rong Xiong

Disambiguate Gripper State in Grasp-Based Tasks: Pseudo-Tactile as Feedback Enables Pure Simulation Learning

Mar 31, 2025

Yifei Yang, Lu Chen, Zherui Song, Yenan Chen, Wentao Sun, Zhongxiang Zhou, Rong Xiong, Yue Wang

Abstract:Grasp-based manipulation tasks are fundamental to robots interacting with their environments, yet gripper state ambiguity significantly reduces the robustness of imitation learning policies for these tasks. Data-driven solutions face the challenge of high real-world data costs, while simulation data, despite its low costs, is limited by the sim-to-real gap. We identify the root cause of gripper state ambiguity as the lack of tactile feedback. To address this, we propose a novel approach employing pseudo-tactile as feedback, inspired by the idea of using a force-controlled gripper as a tactile sensor. This method enhances policy robustness without additional data collection and hardware involvement, while providing a noise-free binary gripper state observation for the policy and thus facilitating pure simulation learning to unleash the power of simulation. Experimental results across three real-world grasp-based tasks demonstrate the necessity, effectiveness, and efficiency of our approach.

* 8 pages, 5 figures, submitted to IROS 2025, project page: https://yifei-y.github.io/project-pages/Pseudo-Tactile-Feedback/

Via

Access Paper or Ask Questions

Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter

Mar 12, 2025

Kechun Xu, Xunlong Xia, Kaixuan Wang, Yifei Yang, Yunxuan Mao, Bing Deng, Rong Xiong, Yue Wang

Abstract:We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In addition, they primarily leverage vision and language foundation models, focusing less on action priors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. We propose A$^2$, an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions.

Via

Access Paper or Ask Questions

Natural Humanoid Robot Locomotion with Generative Motion Prior

Mar 12, 2025

Haodong Zhang, Liang Zhang, Zhenghan Chen, Lu Chen, Yue Wang, Rong Xiong

Abstract:Natural and lifelike locomotion remains a fundamental challenge for humanoid robots to interact with human society. However, previous methods either neglect motion naturalness or rely on unstable and ambiguous style rewards. In this paper, we propose a novel Generative Motion Prior (GMP) that provides fine-grained motion-level supervision for the task of natural humanoid robot locomotion. To leverage natural human motions, we first employ whole-body motion retargeting to effectively transfer them to the robot. Subsequently, we train a generative model offline to predict future natural reference motions for the robot based on a conditional variational auto-encoder. During policy training, the generative motion prior serves as a frozen online motion generator, delivering precise and comprehensive supervision at the trajectory level, including joint angles and keypoint positions. The generative motion prior significantly enhances training stability and improves interpretability by offering detailed and dense guidance throughout the learning process. Experimental results in both simulation and real-world environments demonstrate that our method achieves superior motion naturalness compared to existing approaches. Project page can be found at https://sites.google.com/view/humanoid-gmp

Via

Access Paper or Ask Questions

CarPlanner: Consistent Auto-regressive Trajectory Planning for Large-scale Reinforcement Learning in Autonomous Driving

Feb 27, 2025

Dongkun Zhang, Jiaming Liang, Ke Guo, Sha Lu, Qi Wang, Rong Xiong, Zhenwei Miao, Yue Wang

Abstract:Trajectory planning is vital for autonomous driving, ensuring safe and efficient navigation in complex environments. While recent learning-based methods, particularly reinforcement learning (RL), have shown promise in specific scenarios, RL planners struggle with training inefficiencies and managing large-scale, real-world driving scenarios. In this paper, we introduce \textbf{CarPlanner}, a \textbf{C}onsistent \textbf{a}uto-\textbf{r}egressive \textbf{Planner} that uses RL to generate multi-modal trajectories. The auto-regressive structure enables efficient large-scale RL training, while the incorporation of consistency ensures stable policy learning by maintaining coherent temporal consistency across time steps. Moreover, CarPlanner employs a generation-selection framework with an expert-guided reward function and an invariant-view module, simplifying RL training and enhancing policy performance. Extensive analysis demonstrates that our proposed RL framework effectively addresses the challenges of training efficiency and performance enhancement, positioning CarPlanner as a promising solution for trajectory planning in autonomous driving. To the best of our knowledge, we are the first to demonstrate that the RL-based planner can surpass both IL- and rule-based state-of-the-arts (SOTAs) on the challenging large-scale real-world dataset nuPlan. Our proposed CarPlanner surpasses RL-, IL-, and rule-based SOTA approaches within this demanding dataset.

* CVPR 2025

Via

Access Paper or Ask Questions

BEV-DWPVO: BEV-based Differentiable Weighted Procrustes for Low Scale-drift Monocular Visual Odometry on Ground

Feb 27, 2025

Yufei Wei, Sha Lu, Wangtao Lu, Rong Xiong, Yue Wang

Abstract:Monocular Visual Odometry (MVO) provides a cost-effective, real-time positioning solution for autonomous vehicles. However, MVO systems face the common issue of lacking inherent scale information from monocular cameras. Traditional methods have good interpretability but can only obtain relative scale and suffer from severe scale drift in long-distance tasks. Learning-based methods under perspective view leverage large amounts of training data to acquire prior knowledge and estimate absolute scale by predicting depth values. However, their generalization ability is limited due to the need to accurately estimate the depth of each point. In contrast, we propose a novel MVO system called BEV-DWPVO. Our approach leverages the common assumption of a ground plane, using Bird's-Eye View (BEV) feature maps to represent the environment in a grid-based structure with a unified scale. This enables us to reduce the complexity of pose estimation from 6 Degrees of Freedom (DoF) to 3-DoF. Keypoints are extracted and matched within the BEV space, followed by pose estimation through a differentiable weighted Procrustes solver. The entire system is fully differentiable, supporting end-to-end training with only pose supervision and no auxiliary tasks. We validate BEV-DWPVO on the challenging long-sequence datasets NCLT, Oxford, and KITTI, achieving superior results over existing MVO methods on most evaluation metrics.

Via

Access Paper or Ask Questions

Compliance while resisting: a shear-thickening fluid controller for physical human-robot interaction

Feb 03, 2025

Lu Chen, Lipeng Chen, Xiangchi Chen, Haojian Lu, Yu Zheng, Jun Wu, Yue Wang, Zhengyou Zhang, Rong Xiong

Figure 1 for Compliance while resisting: a shear-thickening fluid controller for physical human-robot interaction

Figure 2 for Compliance while resisting: a shear-thickening fluid controller for physical human-robot interaction

Figure 3 for Compliance while resisting: a shear-thickening fluid controller for physical human-robot interaction

Figure 4 for Compliance while resisting: a shear-thickening fluid controller for physical human-robot interaction

Abstract:Physical human-robot interaction (pHRI) is widely needed in many fields, such as industrial manipulation, home services, and medical rehabilitation, and puts higher demands on the safety of robots. Due to the uncertainty of the working environment, the pHRI may receive unexpected impact interference, which affects the safety and smoothness of the task execution. The commonly used linear admittance control (L-AC) can cope well with high-frequency small-amplitude noise, but for medium-frequency high-intensity impact, the effect is not as good. Inspired by the solid-liquid phase change nature of shear-thickening fluid, we propose a Shear-thickening Fluid Control (SFC) that can achieve both an easy human-robot collaboration and resistance to impact interference. The SFC's stability, passivity, and phase trajectory are analyzed in detail, the frequency and time domain properties are quantified, and parameter constraints in discrete control and coupled stability conditions are provided. We conducted simulations to compare the frequency and time domain characteristics of L-AC, nonlinear admittance controller (N-AC), and SFC, and validated their dynamic properties. In real-world experiments, we compared the performance of L-AC, N-AC, and SFC in both fixed and mobile manipulators. L-AC exhibits weak resistance to impact. N-AC can resist moderate impacts but not high-intensity ones, and may exhibit self-excited oscillations. In contrast, SFC demonstrated superior impact resistance and maintained stable collaboration, enhancing comfort in cooperative water delivery tasks. Additionally, a case study was conducted in a factory setting, further affirming the SFC's capability in facilitating human-robot collaborative manipulation and underscoring its potential in industrial applications.

Via

Access Paper or Ask Questions

Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework

Jan 28, 2025

Longzhong Lin, Xuewu Lin, Kechun Xu, Haojian Lu, Lichao Huang, Rong Xiong, Yue Wang

Figure 1 for Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework

Figure 2 for Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework

Figure 3 for Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework

Figure 4 for Revisit Mixture Models for Multi-Agent Simulation: Experimental Study within a Unified Framework

Abstract:Simulation plays a crucial role in assessing autonomous driving systems, where the generation of realistic multi-agent behaviors is a key aspect. In multi-agent simulation, the primary challenges include behavioral multimodality and closed-loop distributional shifts. In this study, we revisit mixture models for generating multimodal agent behaviors, which can cover the mainstream methods including continuous mixture models and GPT-like discrete models. Furthermore, we introduce a closed-loop sample generation approach tailored for mixture models to mitigate distributional shifts. Within the unified mixture model~(UniMM) framework, we recognize critical configurations from both model and data perspectives. We conduct a systematic examination of various model configurations, including positive component matching, continuous regression, prediction horizon, and the number of components. Moreover, our investigation into the data configuration highlights the pivotal role of closed-loop samples in achieving realistic simulations. To extend the benefits of closed-loop samples across a broader range of mixture models, we further address the shortcut learning and off-policy learning issues. Leveraging insights from our exploration, the distinct variants proposed within the UniMM framework, including discrete, anchor-free, and anchor-based models, all achieve state-of-the-art performance on the WOSAC benchmark.

Via

Access Paper or Ask Questions

Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

Jan 02, 2025

Xuan Yu, Yuxuan Xie, Yili Liu, Haojian Lu, Rong Xiong, Yiyi Liao, Yue Wang

Figure 1 for Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

Figure 2 for Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

Figure 3 for Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

Figure 4 for Leverage Cross-Attention for End-to-End Open-Vocabulary Panoptic Reconstruction

Abstract:Open-vocabulary panoptic reconstruction offers comprehensive scene understanding, enabling advances in embodied robotics and photorealistic simulation. In this paper, we propose PanopticRecon++, an end-to-end method that formulates panoptic reconstruction through a novel cross-attention perspective. This perspective models the relationship between 3D instances (as queries) and the scene's 3D embedding field (as keys) through their attention map. Unlike existing methods that separate the optimization of queries and keys or overlook spatial proximity, PanopticRecon++ introduces learnable 3D Gaussians as instance queries. This formulation injects 3D spatial priors to preserve proximity while maintaining end-to-end optimizability. Moreover, this query formulation facilitates the alignment of 2D open-vocabulary instance IDs across frames by leveraging optimal linear assignment with instance masks rendered from the queries. Additionally, we ensure semantic-instance segmentation consistency by fusing query-based instance segmentation probabilities with semantic probabilities in a novel panoptic head supervised by a panoptic loss. During training, the number of instance query tokens dynamically adapts to match the number of objects. PanopticRecon++ shows competitive performance in terms of 3D and 2D segmentation and reconstruction performance on both simulation and real-world datasets, and demonstrates a user case as a robot simulator. Our project website is at: https://yuxuan1206.github.io/panopticrecon_pp/

* 18 pages, 10 figures

Via

Access Paper or Ask Questions

Multi-cam Multi-map Visual Inertial Localization: System, Validation and Dataset

Dec 05, 2024

Fuzhang Han, Yufei Wei, Yanmei Jiao, Zhuqing Zhang, Yiyuan Pan, Wenjun Huang, Li Tang, Huan Yin, Xiaqing Ding, Rong Xiong(+1 more)

Figure 1 for Multi-cam Multi-map Visual Inertial Localization: System, Validation and Dataset

Figure 2 for Multi-cam Multi-map Visual Inertial Localization: System, Validation and Dataset

Figure 3 for Multi-cam Multi-map Visual Inertial Localization: System, Validation and Dataset

Figure 4 for Multi-cam Multi-map Visual Inertial Localization: System, Validation and Dataset

Abstract:Map-based localization is crucial for the autonomous movement of robots as it provides real-time positional feedback. However, existing VINS and SLAM systems cannot be directly integrated into the robot's control loop. Although VINS offers high-frequency position estimates, it suffers from drift in long-term operation. And the drift-free trajectory output by SLAM is post-processed with loop correction, which is non-causal. In practical control, it is impossible to update the current pose with future information. Furthermore, existing SLAM evaluation systems measure accuracy after aligning the entire trajectory, which overlooks the transformation error between the odometry start frame and the ground truth frame. To address these issues, we propose a multi-cam multi-map visual inertial localization system, which provides real-time, causal and drift-free position feedback to the robot control loop. Additionally, we analyze the error composition of map-based localization systems and propose a set of evaluation metric suitable for measuring causal localization performance. To validate our system, we design a multi-camera IMU hardware setup and collect a long-term challenging campus dataset. Experimental results demonstrate the higher real-time localization accuracy of the proposed system. To foster community development, both the system and the dataset have been made open source https://github.com/zoeylove/Multi-cam-Multi-map-VILO/tree/main.

Via

Access Paper or Ask Questions

BEV-ODOM: Reducing Scale Drift in Monocular Visual Odometry with BEV Representation

Nov 15, 2024

Yufei Wei, Sha Lu, Fuzhang Han, Rong Xiong, Yue Wang

Figure 1 for BEV-ODOM: Reducing Scale Drift in Monocular Visual Odometry with BEV Representation

Figure 2 for BEV-ODOM: Reducing Scale Drift in Monocular Visual Odometry with BEV Representation

Figure 3 for BEV-ODOM: Reducing Scale Drift in Monocular Visual Odometry with BEV Representation

Figure 4 for BEV-ODOM: Reducing Scale Drift in Monocular Visual Odometry with BEV Representation

Abstract:Monocular visual odometry (MVO) is vital in autonomous navigation and robotics, providing a cost-effective and flexible motion tracking solution, but the inherent scale ambiguity in monocular setups often leads to cumulative errors over time. In this paper, we present BEV-ODOM, a novel MVO framework leveraging the Bird's Eye View (BEV) Representation to address scale drift. Unlike existing approaches, BEV-ODOM integrates a depth-based perspective-view (PV) to BEV encoder, a correlation feature extraction neck, and a CNN-MLP-based decoder, enabling it to estimate motion across three degrees of freedom without the need for depth supervision or complex optimization techniques. Our framework reduces scale drift in long-term sequences and achieves accurate motion estimation across various datasets, including NCLT, Oxford, and KITTI. The results indicate that BEV-ODOM outperforms current MVO methods, demonstrating reduced scale drift and higher accuracy.

Via

Access Paper or Ask Questions