Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vincent Lepetit

Learning to Better Segment Objects from Unseen Classes with Unlabeled Videos

Apr 25, 2021

Yuming Du, Yang Xiao, Vincent Lepetit

Figure 1 for Learning to Better Segment Objects from Unseen Classes with Unlabeled Videos

Figure 2 for Learning to Better Segment Objects from Unseen Classes with Unlabeled Videos

Figure 3 for Learning to Better Segment Objects from Unseen Classes with Unlabeled Videos

Figure 4 for Learning to Better Segment Objects from Unseen Classes with Unlabeled Videos

Abstract:The ability to localize and segment objects from unseen classes would open the door to new applications, such as autonomous object learning in active vision. Nonetheless, improving the performance on unseen classes requires additional training data, while manually annotating the objects of the unseen classes can be labor-extensive and expensive. In this paper, we explore the use of unlabeled video sequences to automatically generate training data for objects of unseen classes. It is in principle possible to apply existing video segmentation methods to unlabeled videos and automatically obtain object masks, which can then be used as a training set even for classes with no manual labels available. However, our experiments show that these methods do not perform well enough for this purpose. We therefore introduce a Bayesian method that is specifically designed to automatically create such a training set: Our method starts from a set of object proposals and relies on (non-realistic) analysis-by-synthesis to select the correct ones by performing an efficient optimization over all the frames simultaneously. Through extensive experiments, we show that our method can generate a high-quality training set which significantly boosts the performance of segmenting objects of unseen classes. We thus believe that our method could open the door for open-world instance segmentation using abundant Internet videos.

* 18 pages, 14 figures. See project page https://dulucas.github.io/Homepage/gbopt/

Via

Access Paper or Ask Questions

Back to the Feature: Learning Robust Camera Localization from Pixels to Pose

Apr 07, 2021

Paul-Edouard Sarlin, Ajaykumar Unagar, Måns Larsson, Hugo Germain, Carl Toft, Viktor Larsson, Marc Pollefeys, Vincent Lepetit, Lars Hammarstrand, Fredrik Kahl(+1 more)

Figure 1 for Back to the Feature: Learning Robust Camera Localization from Pixels to Pose

Figure 2 for Back to the Feature: Learning Robust Camera Localization from Pixels to Pose

Figure 3 for Back to the Feature: Learning Robust Camera Localization from Pixels to Pose

Figure 4 for Back to the Feature: Learning Robust Camera Localization from Pixels to Pose

Abstract:Camera pose estimation in known scenes is a 3D geometry task recently tackled by multiple learning algorithms. Many regress precise geometric quantities, like poses or 3D points, from an input image. This either fails to generalize to new viewpoints or ties the model parameters to a specific scene. In this paper, we go Back to the Feature: we argue that deep networks should focus on learning robust and invariant visual features, while the geometric estimation should be left to principled algorithms. We introduce PixLoc, a scene-agnostic neural network that estimates an accurate 6-DoF pose from an image and a 3D model. Our approach is based on the direct alignment of multiscale deep features, casting camera localization as metric learning. PixLoc learns strong data priors by end-to-end training from pixels to pose and exhibits exceptional generalization to new scenes by separating model parameters and scene geometry. The system can localize in large environments given coarse pose priors but also improve the accuracy of sparse feature matching by jointly refining keypoints and poses with little overhead. The code will be publicly available at https://github.com/cvg/pixloc.

* Accepted to CVPR 2021

Via

Access Paper or Ask Questions

Monte Carlo Scene Search for 3D Scene Understanding

Mar 30, 2021

Shreyas Hampali, Sinisa Stekovic, Sayan Deb Sarkar, Chetan Srinivasa Kumar, Friedrich Fraundorfer, Vincent Lepetit

Figure 1 for Monte Carlo Scene Search for 3D Scene Understanding

Figure 2 for Monte Carlo Scene Search for 3D Scene Understanding

Figure 3 for Monte Carlo Scene Search for 3D Scene Understanding

Figure 4 for Monte Carlo Scene Search for 3D Scene Understanding

Abstract:We explore how a general AI algorithm can be used for 3D scene understanding to reduce the need for training data. More exactly, we propose a modification of the Monte Carlo Tree Search (MCTS) algorithm to retrieve objects and room layouts from noisy RGB-D scans. While MCTS was developed as a game-playing algorithm, we show it can also be used for complex perception problems. Our adapted MCTS algorithm has few easy-to-tune hyperparameters and can optimise general losses. We use it to optimise the posterior probability of objects and room layout hypotheses given the RGB-D data. This results in an analysis-by-synthesis approach that explores the solution space by rendering the current solution and comparing it to the RGB-D observations. To perform this exploration even more efficiently, we propose simple changes to the standard MCTS' tree construction and exploration policy. We demonstrate our approach on the ScanNet dataset. Our method often retrieves configurations that are better than some manual annotations, especially on layouts.

* To be presented at CVPR 2021

Via

Access Paper or Ask Questions

MonteFloor: Extending MCTS for Reconstructing Accurate Large-Scale Floor Plans

Mar 20, 2021

Sinisa Stekovic, Mahdi Rad, Friedrich Fraundorfer, Vincent Lepetit

Figure 1 for MonteFloor: Extending MCTS for Reconstructing Accurate Large-Scale Floor Plans

Figure 2 for MonteFloor: Extending MCTS for Reconstructing Accurate Large-Scale Floor Plans

Figure 3 for MonteFloor: Extending MCTS for Reconstructing Accurate Large-Scale Floor Plans

Figure 4 for MonteFloor: Extending MCTS for Reconstructing Accurate Large-Scale Floor Plans

Abstract:We propose a novel method for reconstructing floor plans from noisy 3D point clouds. Our main contribution is a principled approach that relies on the Monte Carlo Tree Search (MCTS) algorithm to maximize a suitable objective function efficiently despite the complexity of the problem. Like previous work, we first project the input point cloud to a top view to create a density map and extract room proposals from it. Our method selects and optimizes the polygonal shapes of these room proposals jointly to fit the density map and outputs an accurate vectorized floor map even for large complex scenes. To do this, we adapted MCTS, an algorithm originally designed to learn to play games, to select the room proposals by maximizing an objective function combining the fitness with the density map as predicted by a deep network and regularizing terms on the room shapes. We also introduce a refinement step to MCTS that adjusts the shape of the room proposals. For this step, we propose a novel differentiable method for rendering the polygonal shapes of these proposals. We evaluate our method on the recent and challenging Structured3D and Floor-SP datasets and show a significant improvement over the state-of-the-art, without imposing any hard constraints nor assumptions on the floor plan configurations.

Via

Access Paper or Ask Questions

Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation

Mar 12, 2021

Hugo Germain, Vincent Lepetit, Guillaume Bourmaud

Figure 1 for Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation

Figure 2 for Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation

Figure 3 for Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation

Figure 4 for Neural Reprojection Error: Merging Feature Learning and Camera Pose Estimation

Abstract:Absolute camera pose estimation is usually addressed by sequentially solving two distinct subproblems: First a feature matching problem that seeks to establish putative 2D-3D correspondences, and then a Perspective-n-Point problem that minimizes, with respect to the camera pose, the sum of so-called Reprojection Errors (RE). We argue that generating putative 2D-3D correspondences 1) leads to an important loss of information that needs to be compensated as far as possible, within RE, through the choice of a robust loss and the tuning of its hyperparameters and 2) may lead to an RE that conveys erroneous data to the pose estimator. In this paper, we introduce the Neural Reprojection Error (NRE) as a substitute for RE. NRE allows to rethink the camera pose estimation problem by merging it with the feature learning problem, hence leveraging richer information than 2D-3D correspondences and eliminating the need for choosing a robust loss and its hyperparameters. Thus NRE can be used as training loss to learn image descriptors tailored for pose estimation. We also propose a coarse-to-fine optimization method able to very efficiently minimize a sum of NRE terms with respect to the camera pose. We experimentally demonstrate that NRE is a good substitute for RE as it significantly improves both the robustness and the accuracy of the camera pose estimate while being computationally and memory highly efficient. From a broader point of view, we believe this new way of merging deep learning and 3D geometry may be useful in other computer vision applications.

Via

Access Paper or Ask Questions

3D Object Detection and Pose Estimation of Unseen Objects in Color Images with Local Surface Embeddings

Oct 08, 2020

Giorgia Pitteri, Aurélie Bugeau, Slobodan Ilic, Vincent Lepetit

Figure 1 for 3D Object Detection and Pose Estimation of Unseen Objects in Color Images with Local Surface Embeddings

Figure 2 for 3D Object Detection and Pose Estimation of Unseen Objects in Color Images with Local Surface Embeddings

Figure 3 for 3D Object Detection and Pose Estimation of Unseen Objects in Color Images with Local Surface Embeddings

Figure 4 for 3D Object Detection and Pose Estimation of Unseen Objects in Color Images with Local Surface Embeddings

Abstract:We present an approach for detecting and estimating the 3D poses of objects in images that requires only an untextured CAD model and no training phase for new objects. Our approach combines Deep Learning and 3D geometry: It relies on an embedding of local 3D geometry to match the CAD models to the input images. For points at the surface of objects, this embedding can be computed directly from the CAD model; for image locations, we learn to predict it from the image itself. This establishes correspondences between 3D points on the CAD model and 2D locations of the input images. However, many of these correspondences are ambiguous as many points may have similar local geometries. We show that we can use Mask-RCNN in a class-agnostic way to detect the new objects without retraining and thus drastically limit the number of possible correspondences. We can then robustly estimate a 3D pose from these discriminative correspondences using a RANSAC- like algorithm. We demonstrate the performance of this approach on the T-LESS dataset, by using a small number of objects to learn the embedding and testing it on the other objects. Our experiments show that our method is on par or better than previous methods.

Via

Access Paper or Ask Questions

Geometric Correspondence Fields: Learned Differentiable Rendering for 3D Pose Refinement in the Wild

Jul 17, 2020

Alexander Grabner, Yaming Wang, Peizhao Zhang, Peihong Guo, Tong Xiao, Peter Vajda, Peter M. Roth, Vincent Lepetit

Figure 1 for Geometric Correspondence Fields: Learned Differentiable Rendering for 3D Pose Refinement in the Wild

Figure 2 for Geometric Correspondence Fields: Learned Differentiable Rendering for 3D Pose Refinement in the Wild

Figure 3 for Geometric Correspondence Fields: Learned Differentiable Rendering for 3D Pose Refinement in the Wild

Figure 4 for Geometric Correspondence Fields: Learned Differentiable Rendering for 3D Pose Refinement in the Wild

Abstract:We present a novel 3D pose refinement approach based on differentiable rendering for objects of arbitrary categories in the wild. In contrast to previous methods, we make two main contributions: First, instead of comparing real-world images and synthetic renderings in the RGB or mask space, we compare them in a feature space optimized for 3D pose refinement. Second, we introduce a novel differentiable renderer that learns to approximate the rasterization backward pass from data instead of relying on a hand-crafted algorithm. For this purpose, we predict deep cross-domain correspondences between RGB images and 3D model renderings in the form of what we call geometric correspondence fields. These correspondence fields serve as pixel-level gradients which are analytically propagated backward through the rendering pipeline to perform a gradient-based optimization directly on the 3D pose. In this way, we precisely align 3D models to objects in RGB images which results in significantly improved 3D pose estimates. We evaluate our approach on the challenging Pix3D dataset and achieve up to 55% relative improvement compared to state-of-the-art refinement methods in multiple metrics.

* Accepted to European Conference on Computer Vision (ECCV) 2020

Via

Access Paper or Ask Questions

Recent Advances in 3D Object and Hand Pose Estimation

Jun 10, 2020

Vincent Lepetit

Figure 1 for Recent Advances in 3D Object and Hand Pose Estimation

Figure 2 for Recent Advances in 3D Object and Hand Pose Estimation

Figure 3 for Recent Advances in 3D Object and Hand Pose Estimation

Figure 4 for Recent Advances in 3D Object and Hand Pose Estimation

Abstract:3D object and hand pose estimation have huge potentials for Augmented Reality, to enable tangible interfaces, natural interfaces, and blurring the boundaries between the real and virtual worlds. In this chapter, we present the recent developments for 3D object and hand pose estimation using cameras, and discuss their abilities and limitations and the possible future development of the field.

Via

Access Paper or Ask Questions

ALCN: Adaptive Local Contrast Normalization

Apr 15, 2020

Mahdi Rad, Peter M. Roth, Vincent Lepetit

Figure 1 for ALCN: Adaptive Local Contrast Normalization

Figure 2 for ALCN: Adaptive Local Contrast Normalization

Figure 3 for ALCN: Adaptive Local Contrast Normalization

Figure 4 for ALCN: Adaptive Local Contrast Normalization

Abstract:To make Robotics and Augmented Reality applications robust to illumination changes, the current trend is to train a Deep Network with training images captured under many different lighting conditions. Unfortunately, creating such a training set is a very unwieldy and complex task. We therefore propose a novel illumination normalization method that can easily be used for different problems with challenging illumination conditions. Our preliminary experiments show that among current normalization methods, the Difference-of Gaussians method remains a very good baseline, and we introduce a novel illumination normalization model that generalizes it. Our key insight is then that the normalization parameters should depend on the input image, and we aim to train a Convolutional Neural Network to predict these parameters from the input image. This, however, cannot be done in a supervised manner, as the optimal parameters are not known a priori. We thus designed a method to train this network jointly with another network that aims to recognize objects under different illuminations: The latter network performs well when the former network predicts good values for the normalization parameters. We show that our method significantly outperforms standard normalization methods and would also be appear to be universal since it does not have to be re-trained for each new application. Our method improves the robustness to light changes of state-of-the-art 3D object detection and face recognition methods.

* This version corresponds to the pre-print of the paper accepted for Computer Vision and Image Understanding (CVIU). arXiv admin note: substantial text overlap with arXiv:1708.09633

Via

Access Paper or Ask Questions

S2DNet: Learning Accurate Correspondences for Sparse-to-Dense Feature Matching

Apr 03, 2020

Hugo Germain, Guillaume Bourmaud, Vincent Lepetit

Figure 1 for S2DNet: Learning Accurate Correspondences for Sparse-to-Dense Feature Matching

Figure 2 for S2DNet: Learning Accurate Correspondences for Sparse-to-Dense Feature Matching

Figure 3 for S2DNet: Learning Accurate Correspondences for Sparse-to-Dense Feature Matching

Figure 4 for S2DNet: Learning Accurate Correspondences for Sparse-to-Dense Feature Matching

Abstract:Establishing robust and accurate correspondences is a fundamental backbone to many computer vision algorithms. While recent learning-based feature matching methods have shown promising results in providing robust correspondences under challenging conditions, they are often limited in terms of precision. In this paper, we introduce S2DNet, a novel feature matching pipeline, designed and trained to efficiently establish both robust and accurate correspondences. By leveraging a sparse-to-dense matching paradigm, we cast the correspondence learning problem as a supervised classification task to learn to output highly peaked correspondence maps. We show that S2DNet achieves state-of-the-art results on the HPatches benchmark, as well as on several long-term visual localization datasets.

Via

Access Paper or Ask Questions