Alert button
Picture for Renaud Marlet

Renaud Marlet

Alert button

Revisiting the Distillation of Image Representations into Point Clouds for Autonomous Driving

Oct 26, 2023
Gilles Puy, Spyros Gidaris, Alexandre Boulch, Oriane Siméoni, Corentin Sautier, Patrick Pérez, Andrei Bursuc, Renaud Marlet

Self-supervised image networks can be used to address complex 2D tasks (e.g., semantic segmentation, object discovery) very efficiently and with little or no downstream supervision. However, self-supervised 3D networks on lidar data do not perform as well for now. A few methods therefore propose to distill high-quality self-supervised 2D features into 3D networks. The most recent ones doing so on autonomous driving data show promising results. Yet, a performance gap persists between these distilled features and fully-supervised ones. In this work, we revisit 2D-to-3D distillation. First, we propose, for semantic segmentation, a simple approach that leads to a significant improvement compared to prior 3D distillation methods. Second, we show that distillation in high capacity 3D networks is key to reach high quality 3D features. This actually allows us to significantly close the gap between unsupervised distilled 3D features and fully-supervised ones. Last, we show that our high-quality distilled representations can also be used for open-vocabulary segmentation and background/foreground discovery.

Viaarxiv icon

BEVContrast: Self-Supervision in BEV Space for Automotive Lidar Point Clouds

Oct 26, 2023
Corentin Sautier, Gilles Puy, Alexandre Boulch, Renaud Marlet, Vincent Lepetit

We present a surprisingly simple and efficient method for self-supervision of 3D backbone on automotive Lidar point clouds. We design a contrastive loss between features of Lidar scans captured in the same scene. Several such approaches have been proposed in the literature from PointConstrast, which uses a contrast at the level of points, to the state-of-the-art TARL, which uses a contrast at the level of segments, roughly corresponding to objects. While the former enjoys a great simplicity of implementation, it is surpassed by the latter, which however requires a costly pre-processing. In BEVContrast, we define our contrast at the level of 2D cells in the Bird's Eye View plane. Resulting cell-level representations offer a good trade-off between the point-level representations exploited in PointContrast and segment-level representations exploited in TARL: we retain the simplicity of PointContrast (cell representations are cheap to compute) while surpassing the performance of TARL in downstream semantic segmentation.

* Accepted to 3DV 2024 
Viaarxiv icon

DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion

Sep 04, 2023
Cédric Rommel, Eduardo Valle, Mickaël Chen, Souhaiel Khalfaoui, Renaud Marlet, Matthieu Cord, Patrick Pérez

Figure 1 for DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion
Figure 2 for DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion
Figure 3 for DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion
Figure 4 for DiffHPE: Robust, Coherent 3D Human Pose Lifting with Diffusion

We present an innovative approach to 3D Human Pose Estimation (3D-HPE) by integrating cutting-edge diffusion models, which have revolutionized diverse fields, but are relatively unexplored in 3D-HPE. We show that diffusion models enhance the accuracy, robustness, and coherence of human pose estimations. We introduce DiffHPE, a novel strategy for harnessing diffusion models in 3D-HPE, and demonstrate its ability to refine standard supervised 3D-HPE. We also show how diffusion models lead to more robust estimations in the face of occlusions, and improve the time-coherence and the sagittal symmetry of predictions. Using the Human\,3.6M dataset, we illustrate the effectiveness of our approach and its superiority over existing models, even under adverse situations where the occlusion patterns in training do not match those in inference. Our findings indicate that while standalone diffusion models provide commendable performance, their accuracy is even better in combination with supervised models, opening exciting new avenues for 3D-HPE research.

* Accepted to 2023 International Conference on Computer Vision Workshop (Analysis and Modeling of Faces and Gestures) 
Viaarxiv icon

You Never Get a Second Chance To Make a Good First Impression: Seeding Active Learning for 3D Semantic Segmentation

Apr 23, 2023
Nermin Samet, Oriane Siméoni, Gilles Puy, Georgy Ponimatkin, Renaud Marlet, Vincent Lepetit

Figure 1 for You Never Get a Second Chance To Make a Good First Impression: Seeding Active Learning for 3D Semantic Segmentation
Figure 2 for You Never Get a Second Chance To Make a Good First Impression: Seeding Active Learning for 3D Semantic Segmentation
Figure 3 for You Never Get a Second Chance To Make a Good First Impression: Seeding Active Learning for 3D Semantic Segmentation
Figure 4 for You Never Get a Second Chance To Make a Good First Impression: Seeding Active Learning for 3D Semantic Segmentation

We propose SeedAL, a method to seed active learning for efficient annotation of 3D point clouds for semantic segmentation. Active Learning (AL) iteratively selects relevant data fractions to annotate within a given budget, but requires a first fraction of the dataset (a 'seed') to be already annotated to estimate the benefit of annotating other data fractions. We first show that the choice of the seed can significantly affect the performance of many AL methods. We then propose a method for automatically constructing a seed that will ensure good performance for AL. Assuming that images of the point clouds are available, which is common, our method relies on powerful unsupervised image features to measure the diversity of the point clouds. It selects the point clouds for the seed by optimizing the diversity under an annotation budget, which can be done by solving a linear optimization problem. Our experiments demonstrate the effectiveness of our approach compared to random seeding and existing methods on both the S3DIS and SemanticKitti datasets. Code is available at \url{https://github.com/nerminsamet/seedal}.

Viaarxiv icon

SALUDA: Surface-based Automotive Lidar Unsupervised Domain Adaptation

Apr 06, 2023
Bjoern Michele, Alexandre Boulch, Gilles Puy, Tuan-Hung Vu, Renaud Marlet, Nicolas Courty

Figure 1 for SALUDA: Surface-based Automotive Lidar Unsupervised Domain Adaptation
Figure 2 for SALUDA: Surface-based Automotive Lidar Unsupervised Domain Adaptation
Figure 3 for SALUDA: Surface-based Automotive Lidar Unsupervised Domain Adaptation
Figure 4 for SALUDA: Surface-based Automotive Lidar Unsupervised Domain Adaptation

Learning models on one labeled dataset that generalize well on another domain is a difficult task, as several shifts might happen between the data domains. This is notably the case for lidar data, for which models can exhibit large performance discrepancies due for instance to different lidar patterns or changes in acquisition conditions. This paper addresses the corresponding Unsupervised Domain Adaptation (UDA) task for semantic segmentation. To mitigate this problem, we introduce an unsupervised auxiliary task of learning an implicit underlying surface representation simultaneously on source and target data. As both domains share the same latent representation, the model is forced to accommodate discrepancies between the two sources of data. This novel strategy differs from classical minimization of statistical divergences or lidar-specific state-of-the-art domain adaptation techniques. Our experiments demonstrate that our method achieves a better performance than the current state of the art in synthetic-to-real and real-to-real scenarios.

* Project repository: github.com/valeoai/SALUDA 
Viaarxiv icon

A Survey and Benchmark of Automatic Surface Reconstruction from Point Clouds

Jan 31, 2023
Raphael Sulzer, Loic Landrieu, Renaud Marlet, Bruno Vallet

Figure 1 for A Survey and Benchmark of Automatic Surface Reconstruction from Point Clouds
Figure 2 for A Survey and Benchmark of Automatic Surface Reconstruction from Point Clouds
Figure 3 for A Survey and Benchmark of Automatic Surface Reconstruction from Point Clouds
Figure 4 for A Survey and Benchmark of Automatic Surface Reconstruction from Point Clouds

We survey and benchmark traditional and novel learning-based algorithms that address the problem of surface reconstruction from point clouds. Surface reconstruction from point clouds is particularly challenging when applied to real-world acquisitions, due to noise, outliers, non-uniform sampling and missing data. Traditionally, different handcrafted priors of the input points or the output surface have been proposed to make the problem more tractable. However, hyperparameter tuning for adjusting priors to different acquisition defects can be a tedious task. To this end, the deep learning community has recently addressed the surface reconstruction problem. In contrast to traditional approaches, deep surface reconstruction methods can learn priors directly from a training set of point clouds and corresponding true surfaces. In our survey, we detail how different handcrafted and learned priors affect the robustness of methods to defect-laden input and their capability to generate geometric and topologically accurate reconstructions. In our benchmark, we evaluate the reconstructions of several traditional and learning-based methods on the same grounds. We show that learning-based methods can generalize to unseen shape categories, but their training and test sets must share the same point cloud characteristics. We also provide the code and data to compete in our benchmark and to further stimulate the development of learning-based surface reconstruction https://github.com/raphaelsulzer/dsr-benchmark.

Viaarxiv icon

RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving

Jan 24, 2023
Angelika Ando, Spyros Gidaris, Andrei Bursuc, Gilles Puy, Alexandre Boulch, Renaud Marlet

Figure 1 for RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving
Figure 2 for RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving
Figure 3 for RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving
Figure 4 for RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving

Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs' lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. We provide the implementation code at https://github.com/valeoai/rangevit.

* Code at https://github.com/valeoai/rangevit 
Viaarxiv icon

Using a Waffle Iron for Automotive Point Cloud Semantic Segmentation

Jan 24, 2023
Gilles Puy, Alexandre Boulch, Renaud Marlet

Figure 1 for Using a Waffle Iron for Automotive Point Cloud Semantic Segmentation
Figure 2 for Using a Waffle Iron for Automotive Point Cloud Semantic Segmentation
Figure 3 for Using a Waffle Iron for Automotive Point Cloud Semantic Segmentation
Figure 4 for Using a Waffle Iron for Automotive Point Cloud Semantic Segmentation

Semantic segmentation of point clouds in autonomous driving datasets requires techniques that can process large numbers of points over large field of views. Today, most deep networks designed for this task exploit 3D sparse convolutions to reduce memory and computational loads. The best methods then further exploit specificities of rotating lidar sampling patterns to further improve the performance, e.g., cylindrical voxels, or range images (for feature fusion from multiple point cloud representations). In contrast, we show that one can build a well-performing point-based backbone free of these specialized tools. This backbone, WaffleIron, relies heavily on generic MLPs and dense 2D convolutions, making it easy to implement, and contains just a few parameters easy to tune. Despite its simplicity, our experiments on SemanticKITTI and nuScenes show that WaffleIron competes with the best methods designed specifically for these autonomous driving datasets. Hence, WaffleIron is a strong, easy-to-implement, baseline for semantic segmentation of sparse outdoor point clouds.

Viaarxiv icon

ALSO: Automotive Lidar Self-supervision by Occupancy estimation

Dec 13, 2022
Alexandre Boulch, Corentin Sautier, Björn Michele, Gilles Puy, Renaud Marlet

Figure 1 for ALSO: Automotive Lidar Self-supervision by Occupancy estimation
Figure 2 for ALSO: Automotive Lidar Self-supervision by Occupancy estimation
Figure 3 for ALSO: Automotive Lidar Self-supervision by Occupancy estimation
Figure 4 for ALSO: Automotive Lidar Self-supervision by Occupancy estimation

We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds. The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled, and to use the underlying latent vectors as input to the perception head. The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information, that can be used to boost an actual perception task. This principle has a very simple formulation, which makes it both easy to implement and widely applicable to a large range of 3D sensors and deep networks performing semantic segmentation or object detection. In fact, it supports a single-stream pipeline, as opposed to most contrastive learning approaches, allowing training on limited resources. We conducted extensive experiments on various autonomous driving datasets, involving very different kinds of lidars, for both semantic segmentation and object detection. The results show the effectiveness of our method to learn useful representations without any annotation, compared to existing approaches. Code is available at https://github.com/valeoai/ALSO

Viaarxiv icon

A Simple and Powerful Global Optimization for Unsupervised Video Object Segmentation

Sep 19, 2022
Georgy Ponimatkin, Nermin Samet, Yang Xiao, Yuming Du, Renaud Marlet, Vincent Lepetit

Figure 1 for A Simple and Powerful Global Optimization for Unsupervised Video Object Segmentation
Figure 2 for A Simple and Powerful Global Optimization for Unsupervised Video Object Segmentation
Figure 3 for A Simple and Powerful Global Optimization for Unsupervised Video Object Segmentation
Figure 4 for A Simple and Powerful Global Optimization for Unsupervised Video Object Segmentation

We propose a simple, yet powerful approach for unsupervised object segmentation in videos. We introduce an objective function whose minimum represents the mask of the main salient object over the input sequence. It only relies on independent image features and optical flows, which can be obtained using off-the-shelf self-supervised methods. It scales with the length of the sequence with no need for superpixels or sparsification, and it generalizes to different datasets without any specific training. This objective function can actually be derived from a form of spectral clustering applied to the entire video. Our method achieves on-par performance with the state of the art on standard benchmarks (DAVIS2016, SegTrack-v2, FBMS59), while being conceptually and practically much simpler. Code is available at https://ponimatkin.github.io/ssl-vos.

Viaarxiv icon