Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Raquel Urtasun

MP3: A Unified Model to Map, Perceive, Predict and Plan

Jan 18, 2021
Sergio Casas, Abbas Sadat, Raquel Urtasun

Figure 1 for MP3: A Unified Model to Map, Perceive, Predict and Plan

Figure 2 for MP3: A Unified Model to Map, Perceive, Predict and Plan

Figure 3 for MP3: A Unified Model to Map, Perceive, Predict and Plan

Figure 4 for MP3: A Unified Model to Map, Perceive, Predict and Plan

High-definition maps (HD maps) are a key component of most modern self-driving systems due to their valuable semantic and geometric information. Unfortunately, building HD maps has proven hard to scale due to their cost as well as the requirements they impose in the localization system that has to work everywhere with centimeter-level accuracy. Being able to drive without an HD map would be very beneficial to scale self-driving solutions as well as to increase the failure tolerance of existing ones (e.g., if localization fails or the map is not up-to-date). Towards this goal, we propose MP3, an end-to-end approach to mapless driving where the input is raw sensor data and a high-level command (e.g., turn left at the intersection). MP3 predicts intermediate representations in the form of an online map and the current and future state of dynamic agents, and exploits them in a novel neural motion planner to make interpretable decisions taking into account uncertainty. We show that our approach is significantly safer, more comfortable, and can follow commands better than the baselines in challenging long-term closed-loop simulations, as well as when compared to an expert driver in a large-scale real-world dataset.

Via

Access Paper or Ask Questions

Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving

Jan 17, 2021
James Tu, Huichen Li, Xinchen Yan, Mengye Ren, Yun Chen, Ming Liang, Eilyan Bitar, Ersin Yumer, Raquel Urtasun

Figure 1 for Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving

Figure 2 for Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving

Figure 3 for Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving

Figure 4 for Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving

Modern self-driving perception systems have been shown to improve upon processing complementary inputs such as LiDAR with images. In isolation, 2D images have been found to be extremely vulnerable to adversarial attacks. Yet, there have been limited studies on the adversarial robustness of multi-modal models that fuse LiDAR features with image features. Furthermore, existing works do not consider physically realizable perturbations that are consistent across the input modalities. In this paper, we showcase practical susceptibilities of multi-sensor detection by placing an adversarial object on top of a host vehicle. We focus on physically realizable and input-agnostic attacks as they are feasible to execute in practice, and show that a single universal adversary can hide different host vehicles from state-of-the-art multi-modal detectors. Our experiments demonstrate that successful attacks are primarily caused by easily corrupted image features. Furthermore, we find that in modern sensor fusion methods which project image features into 3D, adversarial attacks can exploit the projection process to generate false positives across distant regions in 3D. Towards more robust multi-modal perception systems, we show that adversarial training with feature denoising can boost robustness to such attacks significantly. However, we find that standard adversarial defenses still struggle to prevent false positives which are also caused by inaccurate associations between 3D LiDAR points and 2D pixels.

Via

Access Paper or Ask Questions

Deep Parametric Continuous Convolutional Neural Networks

Jan 17, 2021
Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei Pokrovsky, Raquel Urtasun

Figure 1 for Deep Parametric Continuous Convolutional Neural Networks

Figure 2 for Deep Parametric Continuous Convolutional Neural Networks

Figure 3 for Deep Parametric Continuous Convolutional Neural Networks

Figure 4 for Deep Parametric Continuous Convolutional Neural Networks

Standard convolutional neural networks assume a grid structured input is available and exploit discrete convolutions as their fundamental building blocks. This limits their applicability to many real-world applications. In this paper we propose Parametric Continuous Convolution, a new learnable operator that operates over non-grid structured data. The key idea is to exploit parameterized kernel functions that span the full continuous vector space. This generalization allows us to learn over arbitrary data structures as long as their support relationship is computable. Our experiments show significant improvement over the state-of-the-art in point cloud segmentation of indoor and outdoor scenes, and lidar motion estimation of driving scenes.

* Accepted by CVPR 2018

Via

Access Paper or Ask Questions

End-to-end Interpretable Neural Motion Planner

Jan 17, 2021
Wenyuan Zeng, Wenjie Luo, Simon Suo, Abbas Sadat, Bin Yang, Sergio Casas, Raquel Urtasun

Figure 1 for End-to-end Interpretable Neural Motion Planner

Figure 2 for End-to-end Interpretable Neural Motion Planner

Figure 3 for End-to-end Interpretable Neural Motion Planner

Figure 4 for End-to-end Interpretable Neural Motion Planner

In this paper, we propose a neural motion planner (NMP) for learning to drive autonomously in complex urban scenarios that include traffic-light handling, yielding, and interactions with multiple road-users. Towards this goal, we design a holistic model that takes as input raw LIDAR data and a HD map and produces interpretable intermediate representations in the form of 3D detections and their future trajectories, as well as a cost volume defining the goodness of each position that the self-driving car can take within the planning horizon. We then sample a set of diverse physically possible trajectories and choose the one with the minimum learned cost. Importantly, our cost volume is able to naturally capture multi-modality. We demonstrate the effectiveness of our approach in real-world driving data captured in several cities in North America. Our experiments show that the learned cost volume can generate safer planning than all the baselines.

* CVPR 2019 (Oral)

Via

Access Paper or Ask Questions

LaneRCNN: Distributed Representations for Graph-Centric Motion Forecasting

Jan 17, 2021
Wenyuan Zeng, Ming Liang, Renjie Liao, Raquel Urtasun

Figure 1 for LaneRCNN: Distributed Representations for Graph-Centric Motion Forecasting

Figure 2 for LaneRCNN: Distributed Representations for Graph-Centric Motion Forecasting

Figure 3 for LaneRCNN: Distributed Representations for Graph-Centric Motion Forecasting

Figure 4 for LaneRCNN: Distributed Representations for Graph-Centric Motion Forecasting

Forecasting the future behaviors of dynamic actors is an important task in many robotics applications such as self-driving. It is extremely challenging as actors have latent intentions and their trajectories are governed by complex interactions between the other actors, themselves, and the maps. In this paper, we propose LaneRCNN, a graph-centric motion forecasting model. Importantly, relying on a specially designed graph encoder, we learn a local lane graph representation per actor (LaneRoI) to encode its past motions and the local map topology. We further develop an interaction module which permits efficient message passing among local graph representations within a shared global lane graph. Moreover, we parameterize the output trajectories based on lane graphs, a more amenable prediction parameterization. Our LaneRCNN captures the actor-to-actor and the actor-to-map relations in a distributed and map-aware manner. We demonstrate the effectiveness of our approach on the large-scale Argoverse Motion Forecasting Benchmark. We achieve the 1st place on the leaderboard and significantly outperform previous best results.

Via

Access Paper or Ask Questions

Network Automatic Pruning: Start NAP and Take a Nap

Jan 17, 2021
Wenyuan Zeng, Yuwen Xiong, Raquel Urtasun

Figure 1 for Network Automatic Pruning: Start NAP and Take a Nap

Figure 2 for Network Automatic Pruning: Start NAP and Take a Nap

Figure 3 for Network Automatic Pruning: Start NAP and Take a Nap

Figure 4 for Network Automatic Pruning: Start NAP and Take a Nap

Network pruning can significantly reduce the computation and memory footprint of large neural networks. To achieve a good trade-off between model size and performance, popular pruning techniques usually rely on hand-crafted heuristics and require manually setting the compression ratio for each layer. This process is typically time-consuming and requires expert knowledge to achieve good results. In this paper, we propose NAP, a unified and automatic pruning framework for both fine-grained and structured pruning. It can find out unimportant components of a network and automatically decide appropriate compression ratios for different layers, based on a theoretically sound criterion. Towards this goal, NAP uses an efficient approximation of the Hessian for evaluating the importances of components, based on a Kronecker-factored Approximate Curvature method. Despite its simpleness to use, NAP outperforms previous pruning methods by large margins. For fine-grained pruning, NAP can compress AlexNet and VGG16 by 25x, and ResNet-50 by 6.7x without loss in accuracy on ImageNet. For structured pruning (e.g. channel pruning), it can reduce flops of VGG16 by 5.4x and ResNet-50 by 2.3x with only 1% accuracy drop. More importantly, this method is almost free from hyper-parameter tuning and requires no expert knowledge. You can start NAP and then take a nap!

* An updated version of 'MLPrune: Multi-Layer Pruning for Automated Neural Network Compression'

Via

Access Paper or Ask Questions

PLUME: Efficient 3D Object Detection from Stereo Images

Jan 17, 2021
Yan Wang, Bin Yang, Rui Hu, Ming Liang, Raquel Urtasun

Figure 1 for PLUME: Efficient 3D Object Detection from Stereo Images

Figure 2 for PLUME: Efficient 3D Object Detection from Stereo Images

Figure 3 for PLUME: Efficient 3D Object Detection from Stereo Images

Figure 4 for PLUME: Efficient 3D Object Detection from Stereo Images

3D object detection plays a significant role in various robotic applications including self-driving. While many approaches rely on expensive 3D sensors like LiDAR to produce accurate 3D estimates, stereo-based methods have recently shown promising results at a lower cost. Existing methods tackle the problem in two steps: first depth estimation is performed, a pseudo LiDAR point cloud representation is computed from the depth estimates, and then object detection is performed in 3D space. However, because the two separate tasks are optimized in different metric spaces, the depth estimation is biased towards big objects and may cause sub-optimal performance of 3D detection. In this paper we propose a model that unifies these two tasks in the same metric space for the first time. Specifically, our model directly constructs a pseudo LiDAR feature volume (PLUME) in 3D space, which is used to solve both occupancy estimation and object detection tasks. PLUME achieves state-of-the-art performance on the challenging KITTI benchmark, with significantly reduced inference time compared with existing methods.

Via

Access Paper or Ask Questions

Cost-Efficient Online Hyperparameter Optimization

Jan 17, 2021
Jingkang Wang, Mengye Ren, Ilija Bogunovic, Yuwen Xiong, Raquel Urtasun

Figure 1 for Cost-Efficient Online Hyperparameter Optimization

Figure 2 for Cost-Efficient Online Hyperparameter Optimization

Recent work on hyperparameters optimization (HPO) has shown the possibility of training certain hyperparameters together with regular parameters. However, these online HPO algorithms still require running evaluation on a set of validation examples at each training step, steeply increasing the training cost. To decide when to query the validation loss, we model online HPO as a time-varying Bayesian optimization problem, on top of which we propose a novel \textit{costly feedback} setting to capture the concept of the query cost. Under this setting, standard algorithms are cost-inefficient as they evaluate on the validation set at every round. In contrast, the cost-efficient GP-UCB algorithm proposed in this paper queries the unknown function only when the model is less confident about current decisions. We evaluate our proposed algorithm by tuning hyperparameters online for VGG and ResNet on CIFAR-10 and ImageNet100. Our proposed online HPO algorithm reaches human expert-level performance within a single run of the experiment, while incurring only modest computational overhead compared to regular training.

Via

Access Paper or Ask Questions

Auto4D: Learning to Label 4D Objects from Sequential Point Clouds

Jan 17, 2021
Bin Yang, Min Bai, Ming Liang, Wenyuan Zeng, Raquel Urtasun

Figure 1 for Auto4D: Learning to Label 4D Objects from Sequential Point Clouds

Figure 2 for Auto4D: Learning to Label 4D Objects from Sequential Point Clouds

Figure 3 for Auto4D: Learning to Label 4D Objects from Sequential Point Clouds

Figure 4 for Auto4D: Learning to Label 4D Objects from Sequential Point Clouds

In the past few years we have seen great advances in 3D object detection thanks to deep learning methods. However, they typically rely on large amounts of high-quality labels to achieve good performance, which often require time-consuming and expensive work by human annotators. To address this we propose an automatic annotation pipeline that generates accurate object trajectories in 3D (ie, 4D labels) from LiDAR point clouds. Different from previous works that consider single frames at a time, our approach directly operates on sequential point clouds to combine richer object observations. The key idea is to decompose the 4D label into two parts: the 3D size of the object, and its motion path describing the evolution of the object's pose through time. More specifically, given a noisy but easy-to-get object track as initialization, our model first estimates the object size from temporally aggregated observations, and then refines its motion path by considering both frame-wise observations as well as temporal motion cues. We validate the proposed method on a large-scale driving dataset and show that our approach achieves significant improvements over the baselines. We also showcase the benefits of our approach under the annotator-in-the-loop setting.

Via

Access Paper or Ask Questions

S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling

Jan 17, 2021
Ze Yang, Shenlong Wang, Sivabalan Manivasagam, Zeng Huang, Wei-Chiu Ma, Xinchen Yan, Ersin Yumer, Raquel Urtasun

Figure 1 for S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling

Figure 2 for S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling

Figure 3 for S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling

Figure 4 for S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling

Constructing and animating humans is an important component for building virtual worlds in a wide variety of applications such as virtual reality or robotics testing in simulation. As there are exponentially many variations of humans with different shape, pose and clothing, it is critical to develop methods that can automatically reconstruct and animate humans at scale from real world data. Towards this goal, we represent the pedestrian's shape, pose and skinning weights as neural implicit functions that are directly learned from data. This representation enables us to handle a wide variety of different pedestrian shapes and poses without explicitly fitting a human parametric body model, allowing us to handle a wider range of human geometries and topologies. We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods. Furthermore, our re-animation experiments show that we can generate 3D human animations at scale from a single RGB image (and/or an optional LiDAR sweep) as input.

Via

Access Paper or Ask Questions