3D Object Detectors (3D-OD) are crucial for understanding the environment in many robotic tasks, especially autonomous driving. Including 3D information via Lidar sensors improves accuracy greatly. However, such detectors perform poorly on domains they were not trained on, i.e. different locations, sensors, weather, etc., limiting their reliability in safety-critical applications. There exist methods to adapt 3D-ODs to these domains; however, these methods treat 3D-ODs as a black box, neglecting underlying architectural decisions and source-domain training strategies. Instead, we dive deep into the details of 3D-ODs, focusing our efforts on fundamental factors that influence robustness prior to domain adaptation. We systematically investigate four design choices (and the interplay between them) often overlooked in 3D-OD robustness and domain adaptation: architecture, voxel encoding, data augmentations, and anchor strategies. We assess their impact on the robustness of nine state-of-the-art 3D-ODs across six benchmarks encompassing three types of domain gaps - sensor type, weather, and location. Our main findings are: (1) transformer backbones with local point features are more robust than 3D CNNs, (2) test-time anchor size adjustment is crucial for adaptation across geographical locations, significantly boosting scores without retraining, (3) source-domain augmentations allow the model to generalize to low-resolution sensors, and (4) surprisingly, robustness to bad weather is improved when training directly on more clean weather data than on training with bad weather data. We outline our main conclusions and findings to provide practical guidance on developing more robust 3D-ODs.
Currently, High-Definition (HD) maps are a prerequisite for the stable operation of autonomous vehicles. Such maps contain information about all static road objects for the vehicle to consider during navigation, such as road edges, road lanes, crosswalks, and etc. To generate such an HD map, current approaches need to process pre-recorded environment data obtained from onboard sensors. However, recording such a dataset often requires a lot of time and effort. In addition, every time actual road environments are changed, a new dataset should be recorded to generate a relevant HD map. This paper addresses a novel approach that allows to continuously generate or update the HD map using onboard sensor data. When there is no need to pre-record the dataset, updating the HD map can be run in parallel with the main autonomous vehicle navigation pipeline. The proposed approach utilizes the VectorMapNet framework to generate vector road object instances from a sensor data scan. The PolyMerge technique is aimed to merge new instances into previous ones, mitigating detection errors and, therefore, generating or updating the HD map. The performance of the algorithm was confirmed by comparison with ground truth on the NuScenes dataset. Experimental results showed that the mean error for different levels of environment complexity was comparable to the VectorMapNet single instance error.
For augmented reality (AR), it is important that virtual assets appear to `sit among' real world objects. The virtual element should variously occlude and be occluded by real matter, based on a plausible depth ordering. This occlusion should be consistent over time as the viewer's camera moves. Unfortunately, small mistakes in the estimated scene depth can ruin the downstream occlusion mask, and thereby the AR illusion. Especially in real-time settings, depths inferred near boundaries or across time can be inconsistent. In this paper, we challenge the need for depth-regression as an intermediate step. We instead propose an implicit model for depth and use that to predict the occlusion mask directly. The inputs to our network are one or more color images, plus the known depths of any virtual geometry. We show how our occlusion predictions are more accurate and more temporally stable than predictions derived from traditional depth-estimation models. We obtain state-of-the-art occlusion results on the challenging ScanNetv2 dataset and superior qualitative results on real scenes.
It can be easy and even fun to sketch humans in different poses. In contrast, creating those same poses on a 3D graphics "mannequin" is comparatively tedious. Yet 3D body poses are necessary for various downstream applications. We seek to preserve the convenience of 2D sketching while giving users of different skill levels the flexibility to accurately and more quickly pose\slash refine a 3D mannequin. At the core of the interactive system, we propose a machine-learning model for inferring the 3D pose of a CG mannequin from sketches of humans drawn in a cylinder-person style. Training such a model is challenging because of artist variability, a lack of sketch training data with corresponding ground truth 3D poses, and the high dimensionality of human pose-space. Our unique approach to synthesizing vector graphics training data underpins our integrated ML-and-kinematics system. We validate the system by tightly coupling it with a user interface, and by performing a user study, in addition to quantitative comparisons.
Traditionally, 3D indoor scene reconstruction from posed images happens in two phases: per-image depth estimation, followed by depth merging and surface reconstruction. Recently, a family of methods have emerged that perform reconstruction directly in final 3D volumetric feature space. While these methods have shown impressive reconstruction results, they rely on expensive 3D convolutional layers, limiting their application in resource-constrained environments. In this work, we instead go back to the traditional route, and show how focusing on high quality multi-view depth prediction leads to highly accurate 3D reconstructions using simple off-the-shelf depth fusion. We propose a simple state-of-the-art multi-view depth estimator with two main contributions: 1) a carefully-designed 2D CNN which utilizes strong image priors alongside a plane-sweep feature volume and geometric losses, combined with 2) the integration of keyframe and geometric metadata into the cost volume which allows informed depth plane scoring. Our method achieves a significant lead over the current state-of-the-art for depth estimation and close or better for 3D reconstruction on ScanNet and 7-Scenes, yet still allows for online real-time low-memory reconstruction. Code, models and results are available at https://nianticlabs.github.io/simplerecon
The job of a camera operator is more challenging, and potentially dangerous, when filming long moving camera shots. Broadly, the operator must keep the actors in-frame while safely navigating around obstacles, and while fulfilling an artistic vision. We propose a unified hardware and software system that distributes some of the camera operator's burden, freeing them up to focus on safety and aesthetics during a take. Our real-time system provides a solo operator with end-to-end control, so they can balance on-set responsiveness to action vs planned storyboards and framing, while looking where they're going. By default, we film without a field monitor. Our LookOut system is built around a lightweight commodity camera gimbal mechanism, with heavy modifications to the controller, which would normally just provide active stabilization. Our control algorithm reacts to speech commands, video, and a pre-made script. Specifically, our automatic monitoring of the live video feed saves the operator from distractions. In pre-production, an artist uses our GUI to design a sequence of high-level camera "behaviors." Those can be specific, based on a storyboard, or looser objectives, such as "frame both actors." Then during filming, a machine-readable script, exported from the GUI, ties together with the sensor readings to drive the gimbal. To validate our algorithm, we compared tracking strategies, interfaces, and hardware protocols, and collected impressions from a) film-makers who used all aspects of our system, and b) film-makers who watched footage filmed using LookOut.
We wish to detect specific categories of objects, for online vision systems that will run in the real world. Object detection is already very challenging. It is even harder when the images are blurred, from the camera being in a car or a hand-held phone. Most existing efforts either focused on sharp images, with easy to label ground truth, or they have treated motion blur as one of many generic corruptions. Instead, we focus especially on the details of egomotion induced blur. We explore five classes of remedies, where each targets different potential causes for the performance gap between sharp and blurred images. For example, first deblurring an image changes its human interpretability, but at present, only partly improves object detection. The other four classes of remedies address multi-scale texture, out-of-distribution testing, label generation, and conditioning by blur-type. Surprisingly, we discover that custom label generation aimed at resolving spatial ambiguity, ahead of all others, markedly improves object detection. Also, in contrast to findings from classification, we see a noteworthy boost by conditioning our model on bespoke categories of motion blur. We validate and cross-breed the different remedies experimentally on blurred COCO images, producing an easy and practical favorite model with superior detection rates.