Alert button
Picture for Jinkun Cao

Jinkun Cao

Alert button

Universal Humanoid Motion Representations for Physics-Based Control

Oct 06, 2023
Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, Weipeng Xu

Figure 1 for Universal Humanoid Motion Representations for Physics-Based Control
Figure 2 for Universal Humanoid Motion Representations for Physics-Based Control
Figure 3 for Universal Humanoid Motion Representations for Physics-Based Control
Figure 4 for Universal Humanoid Motion Representations for Physics-Based Control

We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control. Due to the high-dimensionality of humanoid control as well as the inherent difficulties in reinforcement learning, prior methods have focused on learning skill embeddings for a narrow range of movement styles (e.g. locomotion, game characters) from specialized motion datasets. This limited scope hampers its applicability in complex tasks. Our work closes this gap, significantly increasing the coverage of motion representation space. To achieve this, we first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset. We then create our motion representation by distilling skills directly from the imitator. This is achieved using an encoder-decoder structure with a variational information bottleneck. Additionally, we jointly learn a prior conditioned on proprioception (humanoid's own pose and velocities) to improve model expressiveness and sampling efficiency for downstream tasks. Sampling from the prior, we can generate long, stable, and diverse human motions. Using this latent space for hierarchical RL, we show that our policies solve tasks using natural and realistic human behavior. We demonstrate the effectiveness of our motion representation by solving generative tasks (e.g. strike, terrain traversal) and motion tracking using VR controllers.

* Project page: https://zhengyiluo.github.io/PULSE/ 
Viaarxiv icon

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

Sep 17, 2023
Zeqi Xiao, Tai Wang, Jingbo Wang, Jinkun Cao, Wenwei Zhang, Bo Dai, Dahua Lin, Jiangmiao Pang

Figure 1 for Unified Human-Scene Interaction via Prompted Chain-of-Contacts
Figure 2 for Unified Human-Scene Interaction via Prompted Chain-of-Contacts
Figure 3 for Unified Human-Scene Interaction via Prompted Chain-of-Contacts
Figure 4 for Unified Human-Scene Interaction via Prompted Chain-of-Contacts

Human-Scene Interaction (HSI) is a vital component of fields like embodied AI and virtual reality. Despite advancements in motion quality and physical plausibility, two pivotal factors, versatile interaction control and the development of a user-friendly interface, require further exploration before the practical application of HSI. This paper presents a unified HSI framework, UniHSI, which supports unified control of diverse interactions through language commands. This framework is built upon the definition of interaction as Chain of Contacts (CoC): steps of human joint-object part pairs, which is inspired by the strong correlation between interaction types and human-object contact regions. Based on the definition, UniHSI constitutes a Large Language Model (LLM) Planner to translate language prompts into task plans in the form of CoC, and a Unified Controller that turns CoC into uniform task execution. To facilitate training and evaluation, we collect a new dataset named ScenePlan that encompasses thousands of task plans generated by LLMs based on diverse scenarios. Comprehensive experiments demonstrate the effectiveness of our framework in versatile task execution and generalizability to real scanned scenes. The project page is at https://github.com/OpenRobotLab/UniHSI .

* A unified Human-Scene Interaction framework that supports versatile interactions through language commands.Project URL: https://xizaoqu.github.io/unihsi/ . Code: https://github.com/OpenRobotLab/UniHSI 
Viaarxiv icon

Perpetual Humanoid Control for Real-time Simulated Avatars

May 24, 2023
Zhengyi Luo, Jinkun Cao, Alexander Winkler, Kris Kitani, Weipeng Xu

Figure 1 for Perpetual Humanoid Control for Real-time Simulated Avatars
Figure 2 for Perpetual Humanoid Control for Real-time Simulated Avatars
Figure 3 for Perpetual Humanoid Control for Real-time Simulated Avatars
Figure 4 for Perpetual Humanoid Control for Real-time Simulated Avatars

We present a physics-based humanoid controller that achieves high-fidelity motion imitation and fault-tolerant behavior in the presence of noisy input (e.g. pose estimates from video or generated from language) and unexpected falls. Our controller scales up to learning ten thousand motion clips without using any external stabilizing forces and learns to naturally recover from fail-state. Given reference motion, our controller can perpetually control simulated avatars without requiring resets. At its core, we propose the progressive multiplicative control policy (PMCP), which dynamically allocates new network capacity to learn harder and harder motion sequences. PMCP allows efficient scaling for learning from large-scale motion databases and adding new tasks, such as fail-state recovery, without catastrophic forgetting. We demonstrate the effectiveness of our controller by using it to imitate noisy poses from video-based pose estimators and language-based motion generators in a live and real-time multi-person avatar use case.

* Project page: https://zhengyiluo.github.io/PHC/ 
Viaarxiv icon

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

Mar 23, 2023
Runsen Xu, Tai Wang, Wenwei Zhang, Runjian Chen, Jinkun Cao, Jiangmiao Pang, Dahua Lin

Figure 1 for MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training
Figure 2 for MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training
Figure 3 for MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training
Figure 4 for MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

This paper introduces the Masked Voxel Jigsaw and Reconstruction (MV-JAR) method for LiDAR-based self-supervised pre-training and a carefully designed data-efficient 3D object detection benchmark on the Waymo dataset. Inspired by the scene-voxel-point hierarchy in downstream 3D object detectors, we design masking and reconstruction strategies accounting for voxel distributions in the scene and local point distributions within the voxel. We employ a Reversed-Furthest-Voxel-Sampling strategy to address the uneven distribution of LiDAR points and propose MV-JAR, which combines two techniques for modeling the aforementioned distributions, resulting in superior performance. Our experiments reveal limitations in previous data-efficient experiments, which uniformly sample fine-tuning splits with varying data proportions from each LiDAR sequence, leading to similar data diversity across splits. To address this, we propose a new benchmark that samples scene sequences for diverse fine-tuning splits, ensuring adequate model convergence and providing a more accurate evaluation of pre-training methods. Experiments on our Waymo benchmark and the KITTI dataset demonstrate that MV-JAR consistently and significantly improves 3D detection performance across various data scales, achieving up to a 6.3% increase in mAPH compared to training from scratch. Codes and the benchmark will be available at https://github.com/SmartBot-PJLab/MV-JAR .

* Accepted by CVPR 2023 with a carefully designed benchmark on Waymo. Codes and the benchmark will be available at https://github.com/SmartBot-PJLab/MV-JAR 
Viaarxiv icon

Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification

Feb 23, 2023
Gerard Maggiolino, Adnan Ahmad, Jinkun Cao, Kris Kitani

Figure 1 for Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification
Figure 2 for Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification
Figure 3 for Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification
Figure 4 for Deep OC-SORT: Multi-Pedestrian Tracking by Adaptive Re-Identification

Motion-based association for Multi-Object Tracking (MOT) has recently re-achieved prominence with the rise of powerful object detectors. Despite this, little work has been done to incorporate appearance cues beyond simple heuristic models that lack robustness to feature degradation. In this paper, we propose a novel way to leverage objects' appearances to adaptively integrate appearance matching into existing high-performance motion-based methods. Building upon the pure motion-based method OC-SORT, we achieve 1st place on MOT20 and 2nd place on MOT17 with 63.9 and 64.9 HOTA, respectively. We also achieve 61.3 HOTA on the challenging DanceTrack benchmark as a new state-of-the-art even compared to more heavily-designed methods. The code and models are available at \url{https://github.com/GerardMaggiolino/Deep-OC-SORT}.

* Ranks 1st among published methods on MOT17, MOT20 and DanceTrack benchmarks; 5 pages 
Viaarxiv icon

Track Targets by Dense Spatio-Temporal Position Encoding

Oct 17, 2022
Jinkun Cao, Hao Wu, Kris Kitani

Figure 1 for Track Targets by Dense Spatio-Temporal Position Encoding
Figure 2 for Track Targets by Dense Spatio-Temporal Position Encoding
Figure 3 for Track Targets by Dense Spatio-Temporal Position Encoding
Figure 4 for Track Targets by Dense Spatio-Temporal Position Encoding

In this work, we propose a novel paradigm to encode the position of targets for target tracking in videos using transformers. The proposed paradigm, Dense Spatio-Temporal (DST) position encoding, encodes spatio-temporal position information in a pixel-wise dense fashion. The provided position encoding provides location information to associate targets across frames beyond appearance matching by comparing objects in two bounding boxes. Compared to the typical transformer positional encoding, our proposed encoding is applied to the 2D CNN features instead of the projected feature vectors to avoid losing positional information. Moreover, the designed DST encoding can represent the location of a single-frame object and the evolution of the location of the trajectory among frames uniformly. Integrated with the DST encoding, we build a transformer-based multi-object tracking model. The model takes a video clip as input and conducts the target association in the clip. It can also perform online inference by associating existing trajectories with objects from the new-coming frames. Experiments on video multi-object tracking (MOT) and multi-object tracking and segmentation (MOTS) datasets demonstrate the effectiveness of the proposed DST position encoding.

* 10 pages, 3 figures, accepted by BMVC 2022 (oral) 
Viaarxiv icon

An Empirical Study on Disentanglement of Negative-free Contrastive Learning

Jun 09, 2022
Jinkun Cao, Ruiqian Nai, Qing Yang, Jialei Huang, Yang Gao

Figure 1 for An Empirical Study on Disentanglement of Negative-free Contrastive Learning
Figure 2 for An Empirical Study on Disentanglement of Negative-free Contrastive Learning
Figure 3 for An Empirical Study on Disentanglement of Negative-free Contrastive Learning
Figure 4 for An Empirical Study on Disentanglement of Negative-free Contrastive Learning

Negative-free contrastive learning has attracted a lot of attention with simplicity and impressive performance for large-scale pretraining. But its disentanglement property remains unexplored. In this paper, we take different negative-free contrastive learning methods to study the disentanglement property of this genre of self-supervised methods empirically. We find the existing disentanglement metrics fail to make meaningful measurements for the high-dimensional representation model so we propose a new disentanglement metric based on Mutual Information between representation and data factors. With the proposed metric, we benchmark the disentanglement property of negative-free contrastive learning for the first time, on both popular synthetic datasets and a real-world dataset CelebA. Our study shows that the investigated methods can learn a well-disentangled subset of representation. We extend the study of the disentangled representation learning to high-dimensional representation space and negative-free contrastive learning for the first time. The implementation of the proposed metric is available at \url{https://github.com/noahcao/disentanglement_lib_med}.

* Implementation available at https://github.com/noahcao/disentanglement_lib_med 
Viaarxiv icon

Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking

Mar 27, 2022
Jinkun Cao, Xinshuo Weng, Rawal Khirodkar, Jiangmiao Pang, Kris Kitani

Figure 1 for Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking
Figure 2 for Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking
Figure 3 for Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking
Figure 4 for Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking

Multi-Object Tracking (MOT) has rapidly progressed with the development of object detection and re-identification. However, motion modeling, which facilitates object association by forecasting short-term trajectories with past observations, has been relatively under-explored in recent years. Current motion models in MOT typically assume that the object motion is linear in a small time window and needs continuous observations, so these methods are sensitive to occlusions and non-linear motion and require high frame-rate videos. In this work, we show that a simple motion model can obtain state-of-the-art tracking performance without other cues like appearance. We emphasize the role of "observation" when recovering tracks from being lost and reducing the error accumulated by linear motion models during the lost period. We thus name the proposed method as Observation-Centric SORT, OC-SORT for short. It remains simple, online, and real-time but improves robustness over occlusion and non-linear motion. It achieves 63.2 and 62.1 HOTA on MOT17 and MOT20, respectively, surpassing all published methods. It also sets new states of the art on KITTI Pedestrian Tracking and DanceTrack where the object motion is highly non-linear. The code and model are available at https://github.com/noahcao/OC_SORT.

* 10 pages + 6 pages of appendix. 6 figures 
Viaarxiv icon