Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yin Zhou

NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors

Dec 06, 2022

Congyue Deng, Chiyu "Max'' Jiang, Charles R. Qi, Xinchen Yan, Yin Zhou, Leonidas Guibas, Dragomir Anguelov

Figure 1 for NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors

Figure 2 for NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors

Figure 3 for NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors

Figure 4 for NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors

Abstract:2D-to-3D reconstruction is an ill-posed problem, yet humans are good at solving this problem due to their prior knowledge of the 3D world developed over years. Driven by this observation, we propose NeRDi, a single-view NeRF synthesis framework with general image priors from 2D diffusion models. Formulating single-view reconstruction as an image-conditioned 3D generation problem, we optimize the NeRF representations by minimizing a diffusion loss on its arbitrary view renderings with a pretrained image diffusion model under the input-view constraint. We leverage off-the-shelf vision-language models and introduce a two-section language guidance as conditioning inputs to the diffusion model. This is essentially helpful for improving multiview content coherence as it narrows down the general image prior conditioned on the semantic and visual features of the single-view input image. Additionally, we introduce a geometric loss based on estimated depth maps to regularize the underlying 3D geometry of the NeRF. Experimental results on the DTU MVS dataset show that our method can synthesize novel views with higher quality even compared to existing methods trained on this dataset. We also demonstrate our generalizability in zero-shot NeRF synthesis for in-the-wild images.

Via

Access Paper or Ask Questions

Improving the Intra-class Long-tail in 3D Detection via Rare Example Mining

Oct 15, 2022

Chiyu Max Jiang, Mahyar Najibi, Charles R. Qi, Yin Zhou, Dragomir Anguelov

Figure 1 for Improving the Intra-class Long-tail in 3D Detection via Rare Example Mining

Figure 2 for Improving the Intra-class Long-tail in 3D Detection via Rare Example Mining

Figure 3 for Improving the Intra-class Long-tail in 3D Detection via Rare Example Mining

Abstract:Continued improvements in deep learning architectures have steadily advanced the overall performance of 3D object detectors to levels on par with humans for certain tasks and datasets, where the overall performance is mostly driven by common examples. However, even the best performing models suffer from the most naive mistakes when it comes to rare examples that do not appear frequently in the training data, such as vehicles with irregular geometries. Most studies in the long-tail literature focus on class-imbalanced classification problems with known imbalanced label counts per class, but they are not directly applicable to the intra-class long-tail examples in problems with large intra-class variations such as 3D object detection, where instances with the same class label can have drastically varied properties such as shapes and sizes. Other works propose to mitigate this problem using active learning based on the criteria of uncertainty, difficulty, or diversity. In this study, we identify a new conceptual dimension - rareness - to mine new data for improving the long-tail performance of models. We show that rareness, as opposed to difficulty, is the key to data-centric improvements for 3D detectors, since rareness is the result of a lack in data support while difficulty is related to the fundamental ambiguity in the problem. We propose a general and effective method to identify the rareness of objects based on density estimation in the feature space using flow models, and propose a principled cost-aware formulation for mining rare object tracks, which improves overall model performance, but more importantly - significantly improves the performance for rare objects (by 30.97\%

* Accepted to European Conference on Computer Vision (ECCV) 2022

Via

Access Paper or Ask Questions

LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds

Oct 14, 2022

Minghua Liu, Yin Zhou, Charles R. Qi, Boqing Gong, Hao Su, Dragomir Anguelov

Figure 1 for LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds

Figure 2 for LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds

Figure 3 for LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds

Figure 4 for LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds

Abstract:Semantic segmentation of LiDAR point clouds is an important task in autonomous driving. However, training deep models via conventional supervised methods requires large datasets which are costly to label. It is critical to have label-efficient segmentation approaches to scale up the model to new operational domains or to improve performance on rare cases. While most prior works focus on indoor scenes, we are one of the first to propose a label-efficient semantic segmentation pipeline for outdoor scenes with LiDAR point clouds. Our method co-designs an efficient labeling process with semi/weakly supervised learning and is applicable to nearly any 3D semantic segmentation backbones. Specifically, we leverage geometry patterns in outdoor scenes to have a heuristic pre-segmentation to reduce the manual labeling and jointly design the learning targets with the labeling process. In the learning step, we leverage prototype learning to get more descriptive point embeddings and use multi-scan distillation to exploit richer semantics from temporally aggregated point clouds to boost the performance of single-scan models. Evaluated on the SemanticKITTI and the nuScenes datasets, we show that our proposed method outperforms existing label-efficient methods. With extremely limited human annotations (e.g., 0.1% point labels), our proposed method is even highly competitive compared to the fully supervised counterpart with 100% labels.

Via

Access Paper or Ask Questions

Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving

Oct 14, 2022

Mahyar Najibi, Jingwei Ji, Yin Zhou, Charles R. Qi, Xinchen Yan, Scott Ettinger, Dragomir Anguelov

Figure 1 for Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving

Figure 2 for Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving

Figure 3 for Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving

Figure 4 for Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving

Abstract:Learning-based perception and prediction modules in modern autonomous driving systems typically rely on expensive human annotation and are designed to perceive only a handful of predefined object categories. This closed-set paradigm is insufficient for the safety-critical autonomous driving task, where the autonomous vehicle needs to process arbitrarily many types of traffic participants and their motion behaviors in a highly dynamic world. To address this difficulty, this paper pioneers a novel and challenging direction, i.e., training perception and prediction models to understand open-set moving objects, with no human supervision. Our proposed framework uses self-learned flow to trigger an automated meta labeling pipeline to achieve automatic supervision. 3D detection experiments on the Waymo Open Dataset show that our method significantly outperforms classical unsupervised approaches and is even competitive to the counterpart with supervised scene flow. We further show that our approach generates highly promising results in open-set 3D detection and trajectory prediction, confirming its potential in closing the safety gap of fully supervised systems.

* ECCV 2022

Via

Access Paper or Ask Questions

LidarNAS: Unifying and Searching Neural Architectures for 3D Point Clouds

Oct 10, 2022

Chenxi Liu, Zhaoqi Leng, Pei Sun, Shuyang Cheng, Charles R. Qi, Yin Zhou, Mingxing Tan, Dragomir Anguelov

Figure 1 for LidarNAS: Unifying and Searching Neural Architectures for 3D Point Clouds

Figure 2 for LidarNAS: Unifying and Searching Neural Architectures for 3D Point Clouds

Figure 3 for LidarNAS: Unifying and Searching Neural Architectures for 3D Point Clouds

Figure 4 for LidarNAS: Unifying and Searching Neural Architectures for 3D Point Clouds

Abstract:Developing neural models that accurately understand objects in 3D point clouds is essential for the success of robotics and autonomous driving. However, arguably due to the higher-dimensional nature of the data (as compared to images), existing neural architectures exhibit a large variety in their designs, including but not limited to the views considered, the format of the neural features, and the neural operations used. Lack of a unified framework and interpretation makes it hard to put these designs in perspective, as well as systematically explore new ones. In this paper, we begin by proposing a unified framework of such, with the key idea being factorizing the neural networks into a series of view transforms and neural layers. We demonstrate that this modular framework can reproduce a variety of existing works while allowing a fair comparison of backbone designs. Then, we show how this framework can easily materialize into a concrete neural architecture search (NAS) space, allowing a principled NAS-for-3D exploration. In performing evolutionary NAS on the 3D object detection task on the Waymo Open Dataset, not only do we outperform the state-of-the-art models, but also report the interesting finding that NAS tends to discover the same macro-level architecture concept for both the vehicle and pedestrian classes.

* ECCV 2022

Via

Access Paper or Ask Questions

RIDDLE: Lidar Data Compression with Range Image Deep Delta Encoding

Jun 02, 2022

Xuanyu Zhou, Charles R. Qi, Yin Zhou, Dragomir Anguelov

Figure 1 for RIDDLE: Lidar Data Compression with Range Image Deep Delta Encoding

Figure 2 for RIDDLE: Lidar Data Compression with Range Image Deep Delta Encoding

Figure 3 for RIDDLE: Lidar Data Compression with Range Image Deep Delta Encoding

Figure 4 for RIDDLE: Lidar Data Compression with Range Image Deep Delta Encoding

Abstract:Lidars are depth measuring sensors widely used in autonomous driving and augmented reality. However, the large volume of data produced by lidars can lead to high costs in data storage and transmission. While lidar data can be represented as two interchangeable representations: 3D point clouds and range images, most previous work focus on compressing the generic 3D point clouds. In this work, we show that directly compressing the range images can leverage the lidar scanning pattern, compared to compressing the unprojected point clouds. We propose a novel data-driven range image compression algorithm, named RIDDLE (Range Image Deep DeLta Encoding). At its core is a deep model that predicts the next pixel value in a raster scanning order, based on contextual laser shots from both the current and past scans (represented as a 4D point cloud of spherical coordinates and time). The deltas between predictions and original values can then be compressed by entropy encoding. Evaluated on the Waymo Open Dataset and KITTI, our method demonstrates significant improvement in the compression rate (under the same distortion) compared to widely used point cloud and range image compression algorithms as well as recent deep methods.

* 14 pages, 10 figures; CVPR 2022

Via

Access Paper or Ask Questions

Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving

Dec 22, 2021

Jingxiao Zheng, Xinwei Shi, Alexander Gorban, Junhua Mao, Yang Song, Charles R. Qi, Ting Liu, Visesh Chari, Andre Cornman, Yin Zhou(+2 more)

Figure 1 for Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving

Figure 2 for Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving

Figure 3 for Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving

Figure 4 for Multi-modal 3D Human Pose Estimation with 2D Weak Supervision in Autonomous Driving

Abstract:3D human pose estimation (HPE) in autonomous vehicles (AV) differs from other use cases in many factors, including the 3D resolution and range of data, absence of dense depth maps, failure modes for LiDAR, relative location between the camera and LiDAR, and a high bar for estimation accuracy. Data collected for other use cases (such as virtual reality, gaming, and animation) may therefore not be usable for AV applications. This necessitates the collection and annotation of a large amount of 3D data for HPE in AV, which is time-consuming and expensive. In this paper, we propose one of the first approaches to alleviate this problem in the AV setting. Specifically, we propose a multi-modal approach which uses 2D labels on RGB images as weak supervision to perform 3D HPE. The proposed multi-modal architecture incorporates LiDAR and camera inputs with an auxiliary segmentation branch. On the Waymo Open Dataset, our approach achieves a 22% relative improvement over camera-only 2D HPE baseline, and 6% improvement over LiDAR-only model. Finally, careful ablation studies and parts based analysis illustrate the advantages of each of our contributions.

Via

Access Paper or Ask Questions

Revisiting 3D Object Detection From an Egocentric Perspective

Dec 14, 2021

Boyang Deng, Charles R. Qi, Mahyar Najibi, Thomas Funkhouser, Yin Zhou, Dragomir Anguelov

Figure 1 for Revisiting 3D Object Detection From an Egocentric Perspective

Figure 2 for Revisiting 3D Object Detection From an Egocentric Perspective

Figure 3 for Revisiting 3D Object Detection From an Egocentric Perspective

Figure 4 for Revisiting 3D Object Detection From an Egocentric Perspective

Abstract:3D object detection is a key module for safety-critical robotics applications such as autonomous driving. For these applications, we care most about how the detections affect the ego-agent's behavior and safety (the egocentric perspective). Intuitively, we seek more accurate descriptions of object geometry when it's more likely to interfere with the ego-agent's motion trajectory. However, current detection metrics, based on box Intersection-over-Union (IoU), are object-centric and aren't designed to capture the spatio-temporal relationship between objects and the ego-agent. To address this issue, we propose a new egocentric measure to evaluate 3D object detection, namely Support Distance Error (SDE). Our analysis based on SDE reveals that the egocentric detection quality is bounded by the coarse geometry of the bounding boxes. Given the insight that SDE would benefit from more accurate geometry descriptions, we propose to represent objects as amodal contours, specifically amodal star-shaped polygons, and devise a simple model, StarPoly, to predict such contours. Our experiments on the large-scale Waymo Open Dataset show that SDE better reflects the impact of detection quality on the ego-agent's safety compared to IoU; and the estimated contours from StarPoly consistently improve the egocentric detection quality over recent 3D object detectors.

* Published in NeurIPS 2021

Via

Access Paper or Ask Questions

SPG: Unsupervised Domain Adaptation for 3D Object Detection via Semantic Point Generation

Aug 15, 2021

Qiangeng Xu, Yin Zhou, Weiyue Wang, Charles R. Qi, Dragomir Anguelov

Figure 1 for SPG: Unsupervised Domain Adaptation for 3D Object Detection via Semantic Point Generation

Figure 2 for SPG: Unsupervised Domain Adaptation for 3D Object Detection via Semantic Point Generation

Figure 3 for SPG: Unsupervised Domain Adaptation for 3D Object Detection via Semantic Point Generation

Figure 4 for SPG: Unsupervised Domain Adaptation for 3D Object Detection via Semantic Point Generation

Abstract:In autonomous driving, a LiDAR-based object detector should perform reliably at different geographic locations and under various weather conditions. While recent 3D detection research focuses on improving performance within a single domain, our study reveals that the performance of modern detectors can drop drastically cross-domain. In this paper, we investigate unsupervised domain adaptation (UDA) for LiDAR-based 3D object detection. On the Waymo Domain Adaptation dataset, we identify the deteriorating point cloud quality as the root cause of the performance drop. To address this issue, we present Semantic Point Generation (SPG), a general approach to enhance the reliability of LiDAR detectors against domain shifts. Specifically, SPG generates semantic points at the predicted foreground regions and faithfully recovers missing parts of the foreground objects, which are caused by phenomena such as occlusions, low reflectance or weather interference. By merging the semantic points with the original points, we obtain an augmented point cloud, which can be directly consumed by modern LiDAR-based detectors. To validate the wide applicability of SPG, we experiment with two representative detectors, PointPillars and PV-RCNN. On the UDA task, SPG significantly improves both detectors across all object categories of interest and at all difficulty levels. SPG can also benefit object detection in the original domain. On the Waymo Open Dataset and KITTI, SPG improves 3D detection results of these two methods across all categories. Combined with PV-RCNN, SPG achieves state-of-the-art 3D detection results on KITTI.

Via

Access Paper or Ask Questions

Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo Open Motion Dataset

Apr 20, 2021

Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles Qi, Yin Zhou(+8 more)

Figure 1 for Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo Open Motion Dataset

Figure 2 for Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo Open Motion Dataset

Figure 3 for Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo Open Motion Dataset

Figure 4 for Large Scale Interactive Motion Forecasting for Autonomous Driving : The Waymo Open Motion Dataset

Abstract:As autonomous driving systems mature, motion forecasting has received increasing attention as a critical requirement for planning. Of particular importance are interactive situations such as merges, unprotected turns, etc., where predicting individual object motion is not sufficient. Joint predictions of multiple objects are required for effective route planning. There has been a critical need for high-quality motion data that is rich in both interactions and annotation to develop motion planning models. In this work, we introduce the most diverse interactive motion dataset to our knowledge, and provide specific labels for interacting objects suitable for developing joint prediction models. With over 100,000 scenes, each 20 seconds long at 10 Hz, our new dataset contains more than 570 hours of unique data over 1750 km of roadways. It was collected by mining for interesting interactions between vehicles, pedestrians, and cyclists across six cities within the United States. We use a high-accuracy 3D auto-labeling system to generate high quality 3D bounding boxes for each road agent, and provide corresponding high definition 3D maps for each scene. Furthermore, we introduce a new set of metrics that provides a comprehensive evaluation of both single agent and joint agent interaction motion forecasting models. Finally, we provide strong baseline models for individual-agent prediction and joint-prediction. We hope that this new large-scale interactive motion dataset will provide new opportunities for advancing motion forecasting models.

* 15 pages, 10 figures

Via

Access Paper or Ask Questions