CAOR
Abstract:Within a perception framework for autonomous mobile and robotic systems, semantic analysis of 3D point clouds typically generated by LiDARs is key to numerous applications, such as object detection and recognition, and scene reconstruction. Scene semantic segmentation can be achieved by directly integrating 3D spatial data with specialized deep neural networks. Although this type of data provides rich geometric information regarding the surrounding environment, it also presents numerous challenges: its unstructured and sparse nature, its unpredictable size, and its demanding computational requirements. These characteristics hinder the real-time semantic analysis, particularly on resource-constrained hardware architectures that constitute the main computational components of numerous robotic applications. Therefore, in this paper, we investigate various 3D semantic segmentation methodologies and analyze their performance and capabilities for resource-constrained inference on embedded NVIDIA Jetson platforms. We evaluate them for a fair comparison through a standardized training protocol and data augmentations, providing benchmark results on the Jetson AGX Orin and AGX Xavier series for two large-scale outdoor datasets: SemanticKITTI and nuScenes.
Abstract:Differentiable volumetric rendering-based methods made significant progress in novel view synthesis. On one hand, innovative methods have replaced the Neural Radiance Fields (NeRF) network with locally parameterized structures, enabling high-quality renderings in a reasonable time. On the other hand, approaches have used differentiable splatting instead of NeRF's ray casting to optimize radiance fields rapidly using Gaussian kernels, allowing for fine adaptation to the scene. However, differentiable ray casting of irregularly spaced kernels has been scarcely explored, while splatting, despite enabling fast rendering times, is susceptible to clearly visible artifacts. Our work closes this gap by providing a physically consistent formulation of the emitted radiance c and density {\sigma}, decomposed with Gaussian functions associated with Spherical Gaussians/Harmonics for all-frequency colorimetric representation. We also introduce a method enabling differentiable ray casting of irregularly distributed Gaussians using an algorithm that integrates radiance fields slab by slab and leverages a BVH structure. This allows our approach to finely adapt to the scene while avoiding splatting artifacts. As a result, we achieve superior rendering quality compared to the state-of-the-art while maintaining reasonable training times and achieving inference speeds of 25 FPS on the Blender dataset. Project page with videos and code: https://raygauss.github.io/
Abstract:LiDAR semantic segmentation for autonomous driving has been a growing field of interest in the past few years. Datasets and methods have appeared and expanded very quickly, but methods have not been updated to exploit this new availability of data and continue to rely on the same classical datasets. Different ways of performing LIDAR semantic segmentation training and inference can be divided into several subfields, which include the following: domain generalization, the ability to segment data coming from unseen domains ; source-to-source segmentation, the ability to segment data coming from the training domain; and pre-training, the ability to create re-usable geometric primitives. In this work, we aim to improve results in all of these subfields with the novel approach of multi-source training. Multi-source training relies on the availability of various datasets at training time and uses them together rather than relying on only one dataset. To overcome the common obstacles found for multi-source training, we introduce the coarse labels and call the newly created multi-source dataset COLA. We propose three applications of this new dataset that display systematic improvement over single-source strategies: COLA-DG for domain generalization (up to +10%), COLA-S2S for source-to-source segmentation (up to +5.3%), and COLA-PT for pre-training (up to +12%).
Abstract:LiDAR is a sensor system that supports autonomous driving by gathering precise geometric information about the scene. Exploiting this information for perception is interesting as the amount of available data increases. As the quantitative performance of various perception tasks has improved, the focus has shifted from source-to-source perception to domain adaptation and domain generalization for perception. These new goals require access to a large variety of domains for evaluation. Unfortunately, the various annotation strategies of data providers complicate the computation of cross-domain performance based on the available data This paper provides a novel dataset, specifically designed for cross-domain evaluation to make it easier to evaluate the performance of various source datasets. Alongside the dataset, a flexible online benchmark is provided to ensure a fair comparison across methods.
Abstract:Supervised 3D Object Detection models have been displaying increasingly better performance in single-domain cases where the training data comes from the same environment and sensor as the testing data. However, in real-world scenarios data from the target domain may not be available for finetuning or for domain adaptation methods. Indeed, 3D object detection models trained on a source dataset with a specific point distribution have shown difficulties in generalizing to unseen datasets. Therefore, we decided to leverage the information available from several annotated source datasets with our Multi-Dataset Training for 3D Object Detection (MDT3D) method to increase the robustness of 3D object detection models when tested in a new environment with a different sensor configuration. To tackle the labelling gap between datasets, we used a new label mapping based on coarse labels. Furthermore, we show how we managed the mix of datasets during training and finally introduce a new cross-dataset augmentation method: cross-dataset object injection. We demonstrate that this training paradigm shows improvements for different types of 3D object detection models. The source code and additional results for this research project will be publicly available on GitHub for interested parties to access and utilize: https://github.com/LouisSF/MDT3D
Abstract:Algorithms for state estimation of humanoid robots usually assume that the feet remain flat and in a constant position while in contact with the ground. However, this hypothesis is easily violated while walking, especially for human-like gaits with heel-toe motion. This reduces the time during which the contact assumption can be used, or requires higher variances to account for errors. In this paper, we present a novel state estimator based on the extended Kalman filter that can properly handle any contact configuration. We consider multiple inertial measurement units (IMUs) distributed throughout the robot's structure, including on both feet, which are used to track multiple bodies of the robot. This multi-IMU instrumentation setup also has the advantage of allowing the deformations in the robot's structure to be estimated, improving the kinematic model used in the filter. The proposed approach is validated experimentally on the exoskeleton Atalante and is shown to present low drift, performing better than similar single-IMU filters. The obtained trajectory estimates are accurate enough to construct elevation maps that have little distortion with respect to the ground truth.
Abstract:3D autonomous driving semantic segmentation using deep learning has become, a well-studied subject, providing methods that can reach very high performance. Nonetheless, because of the limited size of the training datasets, these models cannot see every type of object and scenes found in real-world applications. The ability to be reliable in these various unknown environments is called domain generalization. Despite its importance, domain generalization is relatively unexplored in the case of 3D autonomous driving semantic segmentation. To fill this gap, this paper presents the first benchmark for this application by testing state-of-the-art methods and discussing the difficulty of tackling LiDAR domain shifts. We also propose the first method designed to address this domain generalization, which we call 3DLabelProp. This method relies on leveraging the geometry and sequentiality of the LiDAR data to enhance its generalization performances by working on partially accumulated point clouds. It reaches a mIoU of 52.6% on SemanticPOSS while being trained only on SemanticKITTI, making it state-of-the-art method for generalization (+7.4% better than the second best method). The code for this method will be available on Github.
Abstract:Scene Completion is the task of completing missing geometry from a partial scan of a scene. The majority of previous methods compute an implicit representation from range data using a Truncated Signed Distance Function (TSDF) on a 3D grid as input to neural networks. The truncation limits but does not remove the ambiguous cases introduced by the sign for non-closed surfaces. As an alternative, we present an Unsigned Distance Function (UDF) called Unsigned Weighted Euclidean Distance (UWED) as input to the scene completion neural networks. UWED is simple and efficient as a surface representation, and can be computed on any noisy point cloud without normals. To obtain the explicit geometry, we present a method for extracting a point cloud from discretized UDF values on a regular grid. We compare different SDFs and UDFs for the scene completion task on indoor and outdoor point clouds collected from RGB-D and LiDAR sensors and show improved completion using the proposed UWED function.
Abstract:LiDAR sensors provide rich 3D information about surrounding scenes and are becoming increasingly important for autonomous vehicles' tasks, such as semantic segmentation, object detection, and tracking. Being able to simulate a LiDAR sensor will accelerate the testing, validation, and deployment of autonomous vehicles while reducing the cost and eliminating the risks of testing in real-world scenarios. To tackle the issue of simulating LiDAR data with high fidelity, we present a pipeline that leverages real-world point clouds acquired by mobile mapping systems. Point-based geometry representations, more specifically splats, have proven their ability to accurately model the underlying surface in very large point clouds. Showing the limits of basic splatting, we introduce an adaptative splats generation method that accurately models the underlying 3D geometry, especially for thin structures. We have also developed a LiDAR simulation that is 200 times faster-than-real-time by ray casting on GPU while focusing on efficiently handling large point clouds. We test our LiDAR simulation in real-world conditions, showing qualitative and quantitative results against basic splatting and meshing, demonstrating the superiority of our modeling technique.
Abstract:Transfer learning is a proven technique in 2D computer vision to leverage the large amount of data available and achieve high performance with datasets limited in size due to the cost of acquisition or annotation. In 3D, annotation is known to be a costly task; nevertheless, transfer learning methods have only recently been investigated. Unsupervised pre-training has been heavily favored as no very large annotated dataset are available. In this work, we tackle the case of real-time 3D semantic segmentation of sparse outdoor LiDAR scans. Such datasets have been on the rise, but with different label sets even for the same task. In this work, we propose here an intermediate-level label set called the coarse labels, which allows all the data available to be leveraged without any manual labelization. This way, we have access to a larger dataset, alongside a simpler task of semantic segmentation. With it, we introduce a new pre-training task: the coarse label pre-training, also called COLA. We thoroughly analyze the impact of COLA on various datasets and architectures and show that it yields a noticeable performance improvement, especially when the finetuning task has access only to a small dataset.