Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hang Zhao

SEPT: Towards Efficient Scene Representation Learning for Motion Prediction

Oct 03, 2023

Zhiqian Lan, Yuxuan Jiang, Yao Mu, Chen Chen, Shengbo Eben Li, Hang Zhao, Keqiang Li

Abstract:Motion prediction is crucial for autonomous vehicles to operate safely in complex traffic environments. Extracting effective spatiotemporal relationships among traffic elements is key to accurate forecasting. Inspired by the successful practice of pretrained large language models, this paper presents SEPT, a modeling framework that leverages self-supervised learning to develop powerful spatiotemporal understanding for complex traffic scenes. Specifically, our approach involves three masking-reconstruction modeling tasks on scene inputs including agents' trajectories and road network, pretraining the scene encoder to capture kinematics within trajectory, spatial structure of road network, and interactions among roads and agents. The pretrained encoder is then finetuned on the downstream forecasting task. Extensive experiments demonstrate that SEPT, without elaborate architectural design or manual feature engineering, achieves state-of-the-art performance on the Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming previous methods on all main metrics by a large margin.

Via

Access Paper or Ask Questions

Uncertainty-Aware Decision Transformer for Stochastic Driving Environments

Sep 28, 2023

Zenan Li, Fan Nie, Qiao Sun, Fang Da, Hang Zhao

Figure 1 for Uncertainty-Aware Decision Transformer for Stochastic Driving Environments

Figure 2 for Uncertainty-Aware Decision Transformer for Stochastic Driving Environments

Figure 3 for Uncertainty-Aware Decision Transformer for Stochastic Driving Environments

Figure 4 for Uncertainty-Aware Decision Transformer for Stochastic Driving Environments

Abstract:Offline Reinforcement Learning (RL) has emerged as a promising framework for learning policies without active interactions, making it especially appealing for autonomous driving tasks. Recent successes of Transformers inspire casting offline RL as sequence modeling, which performs well in long-horizon tasks. However, they are overly optimistic in stochastic environments with incorrect assumptions that the same goal can be consistently achieved by identical actions. In this paper, we introduce an UNcertainty-awaRE deciSion Transformer (UNREST) for planning in stochastic driving environments without introducing additional transition or complex generative models. Specifically, UNREST estimates state uncertainties by the conditional mutual information between transitions and returns, and segments sequences accordingly. Discovering the `uncertainty accumulation' and `temporal locality' properties of driving environments, UNREST replaces the global returns in decision transformers with less uncertain truncated returns, to learn from true outcomes of agent actions rather than environment transitions. We also dynamically evaluate environmental uncertainty during inference for cautious planning. Extensive experimental results demonstrate UNREST's superior performance in various driving scenarios and the power of our uncertainty estimation strategy.

Via

Access Paper or Ask Questions

AutoEncoding Tree for City Generation and Applications

Sep 27, 2023

Wenyu Han, Congcong Wen, Lazarus Chok, Yan Liang Tan, Sheung Lung Chan, Hang Zhao, Chen Feng

Figure 1 for AutoEncoding Tree for City Generation and Applications

Figure 2 for AutoEncoding Tree for City Generation and Applications

Figure 3 for AutoEncoding Tree for City Generation and Applications

Figure 4 for AutoEncoding Tree for City Generation and Applications

Abstract:City modeling and generation have attracted an increased interest in various applications, including gaming, urban planning, and autonomous driving. Unlike previous works focused on the generation of single objects or indoor scenes, the huge volumes of spatial data in cities pose a challenge to the generative models. Furthermore, few publicly available 3D real-world city datasets also hinder the development of methods for city generation. In this paper, we first collect over 3,000,000 geo-referenced objects for the city of New York, Zurich, Tokyo, Berlin, Boston and several other large cities. Based on this dataset, we propose AETree, a tree-structured auto-encoder neural network, for city generation. Specifically, we first propose a novel Spatial-Geometric Distance (SGD) metric to measure the similarity between building layouts and then construct a binary tree over the raw geometric data of building based on the SGD metric. Next, we present a tree-structured network whose encoder learns to extract and merge spatial information from bottom-up iteratively. The resulting global representation is reversely decoded for reconstruction or generation. To address the issue of long-dependency as the level of the tree increases, a Long Short-Term Memory (LSTM) Cell is employed as a basic network element of the proposed AETree. Moreover, we introduce a novel metric, Overlapping Area Ratio (OAR), to quantitatively evaluate the generation results. Experiments on the collected dataset demonstrate the effectiveness of the proposed model on 2D and 3D city generation. Furthermore, the latent features learned by AETree can serve downstream urban planning applications.

Via

Access Paper or Ask Questions

Boosting Offline Reinforcement Learning for Autonomous Driving with Hierarchical Latent Skills

Sep 24, 2023

Zenan Li, Fan Nie, Qiao Sun, Fang Da, Hang Zhao

Figure 1 for Boosting Offline Reinforcement Learning for Autonomous Driving with Hierarchical Latent Skills

Figure 2 for Boosting Offline Reinforcement Learning for Autonomous Driving with Hierarchical Latent Skills

Figure 3 for Boosting Offline Reinforcement Learning for Autonomous Driving with Hierarchical Latent Skills

Figure 4 for Boosting Offline Reinforcement Learning for Autonomous Driving with Hierarchical Latent Skills

Abstract:Learning-based vehicle planning is receiving increasing attention with the emergence of diverse driving simulators and large-scale driving datasets. While offline reinforcement learning (RL) is well suited for these safety-critical tasks, it still struggles to plan over extended periods. In this work, we present a skill-based framework that enhances offline RL to overcome the long-horizon vehicle planning challenge. Specifically, we design a variational autoencoder (VAE) to learn skills from offline demonstrations. To mitigate posterior collapse of common VAEs, we introduce a two-branch sequence encoder to capture both discrete options and continuous variations of the complex driving skills. The final policy treats learned skills as actions and can be trained by any off-the-shelf offline RL algorithms. This facilitates a shift in focus from per-step actions to temporally extended skills, thereby enabling long-term reasoning into the future. Extensive results on CARLA prove that our model consistently outperforms strong baselines at both training and new scenarios. Additional visualizations and experiments demonstrate the interpretability and transferability of extracted skills.

Via

Access Paper or Ask Questions

Robot Parkour Learning

Sep 12, 2023

Ziwen Zhuang, Zipeng Fu, Jianren Wang, Christopher Atkeson, Soeren Schwertfeger, Chelsea Finn, Hang Zhao

Abstract:Parkour is a grand challenge for legged locomotion that requires robots to overcome various obstacles rapidly in complex environments. Existing methods can generate either diverse but blind locomotion skills or vision-based but specialized skills by using reference animal data or complex rewards. However, autonomous parkour requires robots to learn generalizable skills that are both vision-based and diverse to perceive and react to various scenarios. In this work, we propose a system for learning a single end-to-end vision-based parkour policy of diverse parkour skills using a simple reward without any reference motion data. We develop a reinforcement learning method inspired by direct collocation to generate parkour skills, including climbing over high obstacles, leaping over large gaps, crawling beneath low barriers, squeezing through thin slits, and running. We distill these skills into a single vision-based parkour policy and transfer it to a quadrupedal robot using its egocentric depth camera. We demonstrate that our system can empower two different low-cost robots to autonomously select and execute appropriate parkour skills to traverse challenging real-world environments.

* CoRL 2023 (Oral). Project website at https://robot-parkour.github.io

Via

Access Paper or Ask Questions

StreamMapNet: Streaming Mapping Network for Vectorized Online HD Map Construction

Aug 27, 2023

Tianyuan Yuan, Yicheng Liu, Yue Wang, Yilun Wang, Hang Zhao

Abstract:High-Definition (HD) maps are essential for the safety of autonomous driving systems. While existing techniques employ camera images and onboard sensors to generate vectorized high-precision maps, they are constrained by their reliance on single-frame input. This approach limits their stability and performance in complex scenarios such as occlusions, largely due to the absence of temporal information. Moreover, their performance diminishes when applied to broader perception ranges. In this paper, we present StreamMapNet, a novel online mapping pipeline adept at long-sequence temporal modeling of videos. StreamMapNet employs multi-point attention and temporal information which empowers the construction of large-range local HD maps with high stability and further addresses the limitations of existing methods. Furthermore, we critically examine widely used online HD Map construction benchmark and datasets, Argoverse2 and nuScenes, revealing significant bias in the existing evaluation protocols. We propose to resplit the benchmarks according to geographical spans, promoting fair and precise evaluations. Experimental results validate that StreamMapNet significantly outperforms existing methods across all settings while maintaining an online inference speed of $14.2$ FPS. Our code is available at https://github.com/yuantianyuan01/StreamMapNet.

Via

Access Paper or Ask Questions

Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals

Aug 16, 2023

Running Zhao, Jiangtao Yu, Hang Zhao, Edith C. H. Ngai

Abstract:Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, considering the practicality in real scenarios, latency and recognizable vocabulary size are two critical factors that cannot be overlooked. In this paper, we propose Radio2Text, the first mmWave-based system for streaming automatic speech recognition (ASR) with a vocabulary size exceeding 13,000 words. Radio2Text is based on a tailored streaming Transformer that is capable of effectively learning representations of speech-related features, paving the way for streaming ASR with a large vocabulary. To alleviate the deficiency of streaming networks unable to access entire future inputs, we propose the Guidance Initialization that facilitates the transfer of feature knowledge related to the global context from the non-streaming Transformer to the tailored streaming Transformer through weight inheritance. Further, we propose a cross-modal structure based on knowledge distillation (KD), named cross-modal KD, to mitigate the negative effect of low quality mmWave signals on recognition performance. In the cross-modal KD, the audio streaming Transformer provides feature and response guidance that inherit fruitful and accurate speech information to supervise the training of the tailored radio streaming Transformer. The experimental results show that our Radio2Text can achieve a character error rate of 5.7% and a word error rate of 9.4% for the recognition of a vocabulary consisting of over 13,000 words.

* Accepted by Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (ACM IMWUT/UbiComp 2023)

Via

Access Paper or Ask Questions

Learning-based Control for PMSM Using Distributed Gaussian Processes with Optimal Aggregation Strategy

Jul 26, 2023

Zhenxiao Yin, Xiaobing Dai, Zewen Yang, Yang Shen, Georges Hattab, Hang Zhao

Figure 1 for Learning-based Control for PMSM Using Distributed Gaussian Processes with Optimal Aggregation Strategy

Figure 2 for Learning-based Control for PMSM Using Distributed Gaussian Processes with Optimal Aggregation Strategy

Figure 3 for Learning-based Control for PMSM Using Distributed Gaussian Processes with Optimal Aggregation Strategy

Abstract:The growing demand for accurate control in varying and unknown environments has sparked a corresponding increase in the requirements for power supply components, including permanent magnet synchronous motors (PMSMs). To infer the unknown part of the system, machine learning techniques are widely employed, especially Gaussian process regression (GPR) due to its flexibility of continuous system modeling and its guaranteed performance. For practical implementation, distributed GPR is adopted to alleviate the high computational complexity. However, the study of distributed GPR from a control perspective remains an open problem. In this paper, a control-aware optimal aggregation strategy of distributed GPR for PMSMs is proposed based on the Lyapunov stability theory. This strategy exclusively leverages the posterior mean, thereby obviating the need for computationally intensive calculations associated with posterior variance in alternative approaches. Moreover, the straightforward calculation process of our proposed strategy lends itself to seamless implementation in high-frequency PMSM control. The effectiveness of the proposed strategy is demonstrated in the simulations.

Via

Access Paper or Ask Questions

Reconstructing Three-decade Global Fine-Grained Nighttime Light Observations by a New Super-Resolution Framework

Jul 14, 2023

Jinyu Guo, Feng Zhang, Hang Zhao, Baoxiang Pan, Linlu Mei

Figure 1 for Reconstructing Three-decade Global Fine-Grained Nighttime Light Observations by a New Super-Resolution Framework

Figure 2 for Reconstructing Three-decade Global Fine-Grained Nighttime Light Observations by a New Super-Resolution Framework

Figure 3 for Reconstructing Three-decade Global Fine-Grained Nighttime Light Observations by a New Super-Resolution Framework

Figure 4 for Reconstructing Three-decade Global Fine-Grained Nighttime Light Observations by a New Super-Resolution Framework

Abstract:Satellite-collected nighttime light provides a unique perspective on human activities, including urbanization, population growth, and epidemics. Yet, long-term and fine-grained nighttime light observations are lacking, leaving the analysis and applications of decades of light changes in urban facilities undeveloped. To fill this gap, we developed an innovative framework and used it to design a new super-resolution model that reconstructs low-resolution nighttime light data into high resolution. The validation of one billion data points shows that the correlation coefficient of our model at the global scale reaches 0.873, which is significantly higher than that of other existing models (maximum = 0.713). Our model also outperforms existing models at the national and urban scales. Furthermore, through an inspection of airports and roads, only our model's image details can reveal the historical development of these facilities. We provide the long-term and fine-grained nighttime light observations to promote research on human activities. The dataset is available at \url{https://doi.org/10.5281/zenodo.7859205}.

Via

Access Paper or Ask Questions

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Jun 29, 2023

Simian Luo, Chuanhao Yan, Chenxu Hu, Hang Zhao

Figure 1 for Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Figure 2 for Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Figure 3 for Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Figure 4 for Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Abstract:The Video-to-Audio (V2A) model has recently gained attention for its practical application in generating audio directly from silent videos, particularly in video/film production. However, previous methods in V2A have limited generation quality in terms of temporal synchronization and audio-visual relevance. We present Diff-Foley, a synchronized Video-to-Audio synthesis method with a latent diffusion model (LDM) that generates high-quality audio with improved synchronization and audio-visual relevance. We adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and semantically aligned features, then train an LDM with CAVP-aligned visual features on spectrogram latent space. The CAVP-aligned features enable LDM to capture the subtler audio-visual correlation via a cross-attention module. We further significantly improve sample quality with `double guidance'. Diff-Foley achieves state-of-the-art V2A performance on current large scale V2A dataset. Furthermore, we demonstrate Diff-Foley practical applicability and generalization capabilities via downstream finetuning. Project Page: see https://diff-foley.github.io/

Via

Access Paper or Ask Questions