Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhizheng Liu

Data-Efficient Learning from Human Interventions for Mobile Robots

Mar 06, 2025

Zhenghao Peng, Zhizheng Liu, Bolei Zhou

Figure 1 for Data-Efficient Learning from Human Interventions for Mobile Robots

Figure 2 for Data-Efficient Learning from Human Interventions for Mobile Robots

Figure 3 for Data-Efficient Learning from Human Interventions for Mobile Robots

Figure 4 for Data-Efficient Learning from Human Interventions for Mobile Robots

Abstract:Mobile robots are essential in applications such as autonomous delivery and hospitality services. Applying learning-based methods to address mobile robot tasks has gained popularity due to its robustness and generalizability. Traditional methods such as Imitation Learning (IL) and Reinforcement Learning (RL) offer adaptability but require large datasets, carefully crafted reward functions, and face sim-to-real gaps, making them challenging for efficient and safe real-world deployment. We propose an online human-in-the-loop learning method PVP4Real that combines IL and RL to address these issues. PVP4Real enables efficient real-time policy learning from online human intervention and demonstration, without reward or any pretraining, significantly improving data efficiency and training safety. We validate our method by training two different robots -- a legged quadruped, and a wheeled delivery robot -- in two mobile robot tasks, one of which even uses raw RGBD image as observation. The training finishes within 15 minutes. Our experiments show the promising future of human-in-the-loop learning in addressing the data efficiency issue in real-world robotic tasks. More information is available at: https://metadriverse.github.io/pvp4real/

* ICRA 2025. Webpage: https://metadriverse.github.io/pvp4real/

Via

Access Paper or Ask Questions

Vid2Sim: Realistic and Interactive Simulation from Video for Urban Navigation

Jan 14, 2025

Ziyang Xie, Zhizheng Liu, Zhenghao Peng, Wayne Wu, Bolei Zhou

Abstract:Sim-to-real gap has long posed a significant challenge for robot learning in simulation, preventing the deployment of learned models in the real world. Previous work has primarily focused on domain randomization and system identification to mitigate this gap. However, these methods are often limited by the inherent constraints of the simulation and graphics engines. In this work, we propose Vid2Sim, a novel framework that effectively bridges the sim2real gap through a scalable and cost-efficient real2sim pipeline for neural 3D scene reconstruction and simulation. Given a monocular video as input, Vid2Sim can generate photorealistic and physically interactable 3D simulation environments to enable the reinforcement learning of visual navigation agents in complex urban environments. Extensive experiments demonstrate that Vid2Sim significantly improves the performance of urban navigation in the digital twins and real world by 31.2% and 68.3% in success rate compared with agents trained with prior simulation methods.

* Project page: https://metadriverse.github.io/vid2sim/

Via

Access Paper or Ask Questions

Joint Optimization for 4D Human-Scene Reconstruction in the Wild

Jan 04, 2025

Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou

Figure 1 for Joint Optimization for 4D Human-Scene Reconstruction in the Wild

Figure 2 for Joint Optimization for 4D Human-Scene Reconstruction in the Wild

Figure 3 for Joint Optimization for 4D Human-Scene Reconstruction in the Wild

Figure 4 for Joint Optimization for 4D Human-Scene Reconstruction in the Wild

Abstract:Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry and human motion. We further design a more efficient model, JOSH3R, and directly train it with pseudo-labels from web videos. JOSH3R outperforms other optimization-free methods by only training with labels predicted from JOSH, further demonstrating its accuracy and generalization ability.

* Project Page: https://genforce.github.io/JOSH/

Via

Access Paper or Ask Questions

Learning to Generate Diverse Pedestrian Movements from Web Videos with Noisy Labels

Oct 10, 2024

Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou

Figure 1 for Learning to Generate Diverse Pedestrian Movements from Web Videos with Noisy Labels

Figure 2 for Learning to Generate Diverse Pedestrian Movements from Web Videos with Noisy Labels

Figure 3 for Learning to Generate Diverse Pedestrian Movements from Web Videos with Noisy Labels

Figure 4 for Learning to Generate Diverse Pedestrian Movements from Web Videos with Noisy Labels

Abstract:Understanding and modeling pedestrian movements in the real world is crucial for applications like motion forecasting and scene simulation. Many factors influence pedestrian movements, such as scene context, individual characteristics, and goals, which are often ignored by the existing human generation methods. Web videos contain natural pedestrian behavior and rich motion context, but annotating them with pre-trained predictors leads to noisy labels. In this work, we propose learning diverse pedestrian movements from web videos. We first curate a large-scale dataset called CityWalkers that captures diverse real-world pedestrian movements in urban scenes. Then, based on CityWalkers, we propose a generative model called PedGen for diverse pedestrian movement generation. PedGen introduces automatic label filtering to remove the low-quality labels and a mask embedding to train with partial labels. It also contains a novel context encoder that lifts the 2D scene context to 3D and can incorporate various context factors in generating realistic pedestrian movements in urban scenes. Experiments show that PedGen outperforms existing baseline methods for pedestrian movement generation by learning from noisy labels and incorporating the context factors. In addition, PedGen achieves zero-shot generalization in both real-world and simulated environments. The code, model, and data will be made publicly available at https://genforce.github.io/PedGen/ .

* Project Page: https://genforce.github.io/PedGen/

Via

Access Paper or Ask Questions

MetaUrban: A Simulation Platform for Embodied AI in Urban Spaces

Jul 11, 2024

Wayne Wu, Honglin He, Yiran Wang, Chenda Duan, Jack He, Zhizheng Liu, Quanyi Li, Bolei Zhou

Figure 1 for MetaUrban: A Simulation Platform for Embodied AI in Urban Spaces

Figure 2 for MetaUrban: A Simulation Platform for Embodied AI in Urban Spaces

Figure 3 for MetaUrban: A Simulation Platform for Embodied AI in Urban Spaces

Figure 4 for MetaUrban: A Simulation Platform for Embodied AI in Urban Spaces

Abstract:Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks with pedestrians, while diverse robot dogs and humanoids have recently emerged in the street. Ensuring the generalizability and safety of these forthcoming mobile machines is crucial when navigating through the bustling streets in urban spaces. In this work, we present MetaUrban, a compositional simulation platform for Embodied AI research in urban spaces. MetaUrban can construct an infinite number of interactive urban scenes from compositional elements, covering a vast array of ground plans, object placements, pedestrians, vulnerable road users, and other mobile agents' appearances and dynamics. We design point navigation and social navigation tasks as the pilot study using MetaUrban for embodied AI research and establish various baselines of Reinforcement Learning and Imitation Learning. Experiments demonstrate that the compositional nature of the simulated environments can substantially improve the generalizability and safety of the trained mobile agents. MetaUrban will be made publicly available to provide more research opportunities and foster safe and trustworthy embodied AI in urban spaces.

* Technical report. Project page: https://metadriverse.github.io/metaurban/

Via

Access Paper or Ask Questions

COOLer: Class-Incremental Learning for Appearance-Based Multiple Object Tracking

Oct 05, 2023

Zhizheng Liu, Mattia Segu, Fisher Yu

Figure 1 for COOLer: Class-Incremental Learning for Appearance-Based Multiple Object Tracking

Figure 2 for COOLer: Class-Incremental Learning for Appearance-Based Multiple Object Tracking

Figure 3 for COOLer: Class-Incremental Learning for Appearance-Based Multiple Object Tracking

Figure 4 for COOLer: Class-Incremental Learning for Appearance-Based Multiple Object Tracking

Abstract:Continual learning allows a model to learn multiple tasks sequentially while retaining the old knowledge without the training data of the preceding tasks. This paper extends the scope of continual learning research to class-incremental learning for multiple object tracking (MOT), which is desirable to accommodate the continuously evolving needs of autonomous systems. Previous solutions for continual learning of object detectors do not address the data association stage of appearance-based trackers, leading to catastrophic forgetting of previous classes' re-identification features. We introduce COOLer, a COntrastive- and cOntinual-Learning-based tracker, which incrementally learns to track new categories while preserving past knowledge by training on a combination of currently available ground truth labels and pseudo-labels generated by the past tracker. To further exacerbate the disentanglement of instance representations, we introduce a novel contrastive class-incremental instance representation learning technique. Finally, we propose a practical evaluation protocol for continual learning for MOT and conduct experiments on the BDD100K and SHIFT datasets. Experimental results demonstrate that COOLer continually learns while effectively addressing catastrophic forgetting of both tracking and detection. The code is available at https://github.com/BoSmallEar/COOLer.

* GCPR 2023 Oral

Via

Access Paper or Ask Questions

ScenarioNet: Open-Source Platform for Large-Scale Traffic Scenario Simulation and Modeling

Jul 02, 2023

Quanyi Li, Zhenghao Peng, Lan Feng, Zhizheng Liu, Chenda Duan, Wenjie Mo, Bolei Zhou

Figure 1 for ScenarioNet: Open-Source Platform for Large-Scale Traffic Scenario Simulation and Modeling

Figure 2 for ScenarioNet: Open-Source Platform for Large-Scale Traffic Scenario Simulation and Modeling

Figure 3 for ScenarioNet: Open-Source Platform for Large-Scale Traffic Scenario Simulation and Modeling

Figure 4 for ScenarioNet: Open-Source Platform for Large-Scale Traffic Scenario Simulation and Modeling

Abstract:Large-scale driving datasets such as Waymo Open Dataset and nuScenes substantially accelerate autonomous driving research, especially for perception tasks such as 3D detection and trajectory forecasting. Since the driving logs in these datasets contain HD maps and detailed object annotations which accurately reflect the real-world complexity of traffic behaviors, we can harvest a massive number of complex traffic scenarios and recreate their digital twins in simulation. Compared to the hand-crafted scenarios often used in existing simulators, data-driven scenarios collected from the real world can facilitate many research opportunities in machine learning and autonomous driving. In this work, we present ScenarioNet, an open-source platform for large-scale traffic scenario modeling and simulation. ScenarioNet defines a unified scenario description format and collects a large-scale repository of real-world traffic scenarios from the heterogeneous data in various driving datasets including Waymo, nuScenes, Lyft L5, and nuPlan datasets. These scenarios can be further replayed and interacted with in multiple views from Bird-Eye-View layout to realistic 3D rendering in MetaDrive simulator. This provides a benchmark for evaluating the safety of autonomous driving stacks in simulation before their real-world deployment. We further demonstrate the strengths of ScenarioNet on large-scale scenario generation, imitation learning, and reinforcement learning in both single-agent and multi-agent settings. Code, demo videos, and website are available at https://metadriverse.github.io/scenarionet.

Via

Access Paper or Ask Questions

DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds

Jun 09, 2023

Tao Ma, Xuemeng Yang, Hongbin Zhou, Xin Li, Botian Shi, Junjie Liu, Yuchen Yang, Zhizheng Liu, Liang He, Yu Qiao(+2 more)

Figure 1 for DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds

Figure 2 for DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds

Figure 3 for DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds

Figure 4 for DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds

Abstract:Existing offboard 3D detectors always follow a modular pipeline design to take advantage of unlimited sequential point clouds. We have found that the full potential of offboard 3D detectors is not explored mainly due to two reasons: (1) the onboard multi-object tracker cannot generate sufficient complete object trajectories, and (2) the motion state of objects poses an inevitable challenge for the object-centric refining stage in leveraging the long-term temporal context representation. To tackle these problems, we propose a novel paradigm of offboard 3D object detection, named DetZero. Concretely, an offline tracker coupled with a multi-frame detector is proposed to focus on the completeness of generated object tracks. An attention-mechanism refining module is proposed to strengthen contextual information interaction across long-term sequential point clouds for object refining with decomposed regression methods. Extensive experiments on Waymo Open Dataset show our DetZero outperforms all state-of-the-art onboard and offboard 3D detection methods. Notably, DetZero ranks 1st place on Waymo 3D object detection leaderboard with 85.15 mAPH (L2) detection performance. Further experiments validate the application of taking the place of human labels with such high-quality results. Our empirical study leads to rethinking conventions and interesting findings that can guide future research on offboard 3D object detection.

* 17 pages, 8 figures

Via

Access Paper or Ask Questions

Unsupervised Continual Semantic Adaptation through Neural Rendering

Nov 25, 2022

Zhizheng Liu, Francesco Milano, Jonas Frey, Marco Hutter, Roland Siegwart, Hermann Blum, Cesar Cadena

Figure 1 for Unsupervised Continual Semantic Adaptation through Neural Rendering

Figure 2 for Unsupervised Continual Semantic Adaptation through Neural Rendering

Figure 3 for Unsupervised Continual Semantic Adaptation through Neural Rendering

Figure 4 for Unsupervised Continual Semantic Adaptation through Neural Rendering

Abstract:An increasing amount of applications rely on data-driven models that are deployed for perception tasks across a sequence of scenes. Due to the mismatch between training and deployment data, adapting the model on the new scenes is often crucial to obtain good performance. In this work, we study continual multi-scene adaptation for the task of semantic segmentation, assuming that no ground-truth labels are available during deployment and that performance on the previous scenes should be maintained. We propose training a Semantic-NeRF network for each scene by fusing the predictions of a segmentation model and then using the view-consistent rendered semantic labels as pseudo-labels to adapt the model. Through joint training with the segmentation model, the Semantic-NeRF model effectively enables 2D-3D knowledge transfer. Furthermore, due to its compact size, it can be stored in a long-term memory and subsequently used to render data from arbitrary viewpoints to reduce forgetting. We evaluate our approach on ScanNet, where we outperform both a voxel-based baseline and a state-of-the-art unsupervised domain adaptation method.

* Zhizheng Liu and Francesco Milano share first authorship. Hermann Blum and Cesar Cadena share senior authorship. 18 pages, 7 figures, 10 tables

Via

Access Paper or Ask Questions

3D Textured Shape Recovery with Learned Geometric Priors

Sep 07, 2022

Lei Li, Zhizheng Liu, Weining Ren, Liudi Yang, Fangjinhua Wang, Marc Pollefeys, Songyou Peng

Figure 1 for 3D Textured Shape Recovery with Learned Geometric Priors

Figure 2 for 3D Textured Shape Recovery with Learned Geometric Priors

Figure 3 for 3D Textured Shape Recovery with Learned Geometric Priors

Figure 4 for 3D Textured Shape Recovery with Learned Geometric Priors

Abstract:3D textured shape recovery from partial scans is crucial for many real-world applications. Existing approaches have demonstrated the efficacy of implicit function representation, but they suffer from partial inputs with severe occlusions and varying object types, which greatly hinders their application value in the real world. This technical report presents our approach to address these limitations by incorporating learned geometric priors. To this end, we generate a SMPL model from learned pose prediction and fuse it into the partial input to add prior knowledge of human bodies. We also propose a novel completeness-aware bounding box adaptation for handling different levels of scales and partialness of partial scans.

* 5 pages, 3 figures, 2 tables

Via

Access Paper or Ask Questions