Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mathieu Salzmann

CVLab EPFL Switzerland

Self-supervised Human Detection and Segmentation via Multi-view Consensus

Dec 09, 2020

Isinsu Katircioglu, Helge Rhodin, Jörg Spörri, Mathieu Salzmann, Pascal Fua

Figure 1 for Self-supervised Human Detection and Segmentation via Multi-view Consensus

Figure 2 for Self-supervised Human Detection and Segmentation via Multi-view Consensus

Figure 3 for Self-supervised Human Detection and Segmentation via Multi-view Consensus

Figure 4 for Self-supervised Human Detection and Segmentation via Multi-view Consensus

Abstract:Self-supervised detection and segmentation of foreground objects in complex scenes is gaining attention as their fully-supervised counterparts require overly large amounts of annotated data to deliver sufficient accuracy in domain-specific applications. However, existing self-supervised approaches predominantly rely on restrictive assumptions on appearance and motion, which precludes their use in scenes depicting highly dynamic activities or involve camera motion. To mitigate this problem, we propose using a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training via coarse 3D localization in a voxel grid and fine-grained offset regression. In this manner, we learn a joint distribution of proposals over multiple views. At inference time, our method operates on single RGB images. We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks, as well as on those of the classical Human3.6M dataset.

Via

Access Paper or Ask Questions

Long Term Motion Prediction Using Keyposes

Dec 08, 2020

Sena Kiciroglu, Wei Wang, Mathieu Salzmann, Pascal Fua

Figure 1 for Long Term Motion Prediction Using Keyposes

Figure 2 for Long Term Motion Prediction Using Keyposes

Figure 3 for Long Term Motion Prediction Using Keyposes

Figure 4 for Long Term Motion Prediction Using Keyposes

Abstract:Long term human motion prediction is an essential component in safety-critical applications, such as human-robot interaction and autonomous driving. We argue that, to achieve long term forecasting, predicting human pose at every time instant is unnecessary because human motion follows patterns that are well-represented by a few essential poses in the sequence. We call such poses "keyposes", and approximate complex motions by linearly interpolating between subsequent keyposes. We show that learning the sequence of such keyposes allows us to predict very long term motion, up to 5 seconds in the future. In particular, our predictions are much more realistic and better preserve the motion dynamics than those obtained by the state-of-the-art methods. Furthermore, our approach models the future keyposes probabilistically, which, during inference, lets us generate diverse future motions via sampling.

* See supplementary video at: https://youtu.be/GNSrwdl80GI

Via

Access Paper or Ask Questions

Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation

Dec 02, 2020

Sina Honari, Victor Constantin, Helge Rhodin, Mathieu Salzmann, Pascal Fua

Figure 1 for Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation

Figure 2 for Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation

Figure 3 for Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation

Figure 4 for Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation

Abstract:In this paper, we introduce an unsupervised feature extraction method that exploits contrastive self-supervised (CSS) learning to extract rich latent vectors from single-view videos. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly separate each latent vector into a time-variant component and a time-invariant one. We then show that applying CSS only to the time-variant features, while also reconstructing the input and encouraging a gradual transition between nearby and away features yields a rich latent space, well-suited for human pose estimation. Our approach outperforms other unsupervised single-view methods and match the performance of multi-view techniques.

Via

Access Paper or Ask Questions

Counting People by Estimating People Flows

Dec 01, 2020

Weizhe Liu, Mathieu Salzmann, Pascal Fua

Figure 1 for Counting People by Estimating People Flows

Figure 2 for Counting People by Estimating People Flows

Figure 3 for Counting People by Estimating People Flows

Figure 4 for Counting People by Estimating People Flows

Abstract:Modern methods for counting people in crowded scenes rely on deep networks to estimate people densities in individual images. As such, only very few take advantage of temporal consistency in video sequences, and those that do only impose weak smoothness constraints across consecutive frames. In this paper, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing them. This enables us to impose much stronger constraints encoding the conservation of the number of people. As a result, it significantly boosts performance without requiring a more complex architecture. Furthermore, it allows us to exploit the correlation between people flow and optical flow to further improve the results. We also show that leveraging people conservation constraints in both a spatial and temporal manner makes it possible to train a deep crowd counting model in an active learning setting with much fewer annotations. This significantly reduces the annotation cost while still leading to similar performance to the full supervision case.

* Extension of Our ECCV 2020 Paper: arXiv:1911.10782

Via

Access Paper or Ask Questions

PCLs: Geometry-aware Neural Reconstruction of 3D Pose with Perspective Crop Layers

Nov 27, 2020

Frank Yu, Mathieu Salzmann, Pascal Fua, Helge Rhodin

Figure 1 for PCLs: Geometry-aware Neural Reconstruction of 3D Pose with Perspective Crop Layers

Figure 2 for PCLs: Geometry-aware Neural Reconstruction of 3D Pose with Perspective Crop Layers

Figure 3 for PCLs: Geometry-aware Neural Reconstruction of 3D Pose with Perspective Crop Layers

Figure 4 for PCLs: Geometry-aware Neural Reconstruction of 3D Pose with Perspective Crop Layers

Abstract:Local processing is an essential feature of CNNs and other neural network architectures - it is one of the reasons why they work so well on images where relevant information is, to a large extent, local. However, perspective effects stemming from the projection in a conventional camera vary for different global positions in the image. We introduce Perspective Crop Layers (PCLs) - a form of perspective crop of the region of interest based on the camera geometry - and show that accounting for the perspective consistently improves the accuracy of state-of-the-art 3D pose reconstruction methods. PCLs are modular neural network layers, which, when inserted into existing CNN and MLP architectures, deterministically remove the location-dependent perspective effects while leaving end-to-end training and the number of parameters of the underlying neural network unchanged. We demonstrate that PCL leads to improved 3D human pose reconstruction accuracy for CNN architectures that use cropping operations, such as spatial transformer networks (STN), and, somewhat surprisingly, MLPs used for 2D-to-3D keypoint lifting. Our conclusion is that it is important to utilize camera calibration information when available, for classical and deep-learning-based computer vision alike. PCL offers an easy way to improve the accuracy of existing 3D reconstruction networks by making them geometry-aware.

Via

Access Paper or Ask Questions

A Closed-Form Solution to Local Non-Rigid Structure-from-Motion

Nov 23, 2020

Shaifali Parashar, Yuxuan Long, Mathieu Salzmann, Pascal Fua

Figure 1 for A Closed-Form Solution to Local Non-Rigid Structure-from-Motion

Figure 2 for A Closed-Form Solution to Local Non-Rigid Structure-from-Motion

Figure 3 for A Closed-Form Solution to Local Non-Rigid Structure-from-Motion

Figure 4 for A Closed-Form Solution to Local Non-Rigid Structure-from-Motion

Abstract:A recent trend in Non-Rigid Structure-from-Motion (NRSfM) is to express local, differential constraints between pairs of images, from which the surface normal at any point can be obtained by solving a system of polynomial equations. The systems of equations derived in previous work, however, are of high degree, having up to five real solutions, thus requiring a computationally expensive strategy to select a unique solution. Furthermore, they suffer from degeneracies that make the resulting estimates unreliable, without any mechanism to identify this situation. In this paper, we show that, under widely applicable assumptions, we can derive a new system of equation in terms of the surface normals whose two solutions can be obtained in closed-form and can easily be disambiguated locally. Our formalism further allows us to assess how reliable the estimated local normals are and, hence, to discard them if they are not. Our experiments show that our reconstructions, obtained from two or more views, are significantly more accurate than those of state-of-the-art methods, while also being faster.

Via

Access Paper or Ask Questions

3D Registration for Self-Occluded Objects in Context

Nov 23, 2020

Zheng Dang, Fei Wang, Mathieu Salzmann

Figure 1 for 3D Registration for Self-Occluded Objects in Context

Figure 2 for 3D Registration for Self-Occluded Objects in Context

Figure 3 for 3D Registration for Self-Occluded Objects in Context

Figure 4 for 3D Registration for Self-Occluded Objects in Context

Abstract:While much progress has been made on the task of 3D point cloud registration, there still exists no learning-based method able to estimate the 6D pose of an object observed by a 2.5D sensor in a scene. The challenges of this scenario include the fact that most measurements are outliers depicting the object's surrounding context, and the mismatch between the complete 3D object model and its self-occluded observations. We introduce the first deep learning framework capable of effectively handling this scenario. Our method consists of an instance segmentation module followed by a pose estimation one. It allows us to perform 3D registration in a one-shot manner, without requiring an expensive iterative procedure. We further develop an on-the-fly rendering-based training strategy that is both time- and memory-efficient. Our experiments evidence the superiority of our approach over the state-of-the-art traditional and learning-based 3D registration methods.

* 8 pages

Via

Access Paper or Ask Questions

Self-supervised Segmentation via Background Inpainting

Nov 11, 2020

Isinsu Katircioglu, Helge Rhodin, Victor Constantin, Jörg Spörri, Mathieu Salzmann, Pascal Fua

Figure 1 for Self-supervised Segmentation via Background Inpainting

Figure 2 for Self-supervised Segmentation via Background Inpainting

Figure 3 for Self-supervised Segmentation via Background Inpainting

Figure 4 for Self-supervised Segmentation via Background Inpainting

Abstract:While supervised object detection and segmentation methods achieve impressive accuracy, they generalize poorly to images whose appearance significantly differs from the data they have been trained on. To address this when annotating data is prohibitively expensive, we introduce a self-supervised detection and segmentation approach that can work with single images captured by a potentially moving camera. At the heart of our approach lies the observation that object segmentation and background reconstruction are linked tasks, and that, for structured scenes, background regions can be re-synthesized from their surroundings, whereas regions depicting the moving object cannot. We encode this intuition into a self-supervised loss function that we exploit to train a proposal-based segmentation network. To account for the discrete nature of the proposals, we develop a Monte Carlo-based training strategy that allows the algorithm to explore the large space of object proposals. We apply our method to human detection and segmentation in images that visually depart from those of standard benchmarks and outperform existing self-supervised methods.

* arXiv admin note: text overlap with arXiv:1907.08051

Via

Access Paper or Ask Questions

Better Patch Stitching for Parametric Surface Reconstruction

Oct 14, 2020

Zhantao Deng, Jan Bednařík, Mathieu Salzmann, Pascal Fua

Figure 1 for Better Patch Stitching for Parametric Surface Reconstruction

Figure 2 for Better Patch Stitching for Parametric Surface Reconstruction

Figure 3 for Better Patch Stitching for Parametric Surface Reconstruction

Figure 4 for Better Patch Stitching for Parametric Surface Reconstruction

Abstract:Recently, parametric mappings have emerged as highly effective surface representations, yielding low reconstruction error. In particular, the latest works represent the target shape as an atlas of multiple mappings, which can closely encode object parts. Atlas representations, however, suffer from one major drawback: The individual mappings are not guaranteed to be consistent, which results in holes in the reconstructed shape or in jagged surface areas. We introduce an approach that explicitly encourages global consistency of the local mappings. To this end, we introduce two novel loss terms. The first term exploits the surface normals and requires that they remain locally consistent when estimated within and across the individual mappings. The second term further encourages better spatial configuration of the mappings by minimizing novel stitching error. We show on standard benchmarks that the use of normal consistency requirement outperforms the baselines quantitatively while enforcing better stitching leads to much better visual quality of the reconstructed objects as compared to the state-of-the-art.

* Accepted to 3DV 2020

Via

Access Paper or Ask Questions

Motion Prediction Using Temporal Inception Module

Oct 06, 2020

Tim Lebailly, Sena Kiciroglu, Mathieu Salzmann, Pascal Fua, Wei Wang

Figure 1 for Motion Prediction Using Temporal Inception Module

Figure 2 for Motion Prediction Using Temporal Inception Module

Figure 3 for Motion Prediction Using Temporal Inception Module

Figure 4 for Motion Prediction Using Temporal Inception Module

Abstract:Human motion prediction is a necessary component for many applications in robotics and autonomous driving. Recent methods propose using sequence-to-sequence deep learning models to tackle this problem. However, they do not focus on exploiting different temporal scales for different length inputs. We argue that the diverse temporal scales are important as they allow us to look at the past frames with different receptive fields, which can lead to better predictions. In this paper, we propose a Temporal Inception Module (TIM) to encode human motion. Making use of TIM, our framework produces input embeddings using convolutional layers, by using different kernel sizes for different input lengths. The experimental results on standard motion prediction benchmark datasets Human3.6M and CMU motion capture dataset show that our approach consistently outperforms the state of the art methods.

* 16 pages, 4 figures. To appear in the proceedings of the 15th Asian Conference on Computer Vision, ACCV 2020

Via

Access Paper or Ask Questions