Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mathieu Salzmann

CVLab EPFL Switzerland

Deep Attentional Structured Representation Learning for Visual Recognition

May 14, 2018

Krishna Kanth Nakka, Mathieu Salzmann

Figure 1 for Deep Attentional Structured Representation Learning for Visual Recognition

Figure 2 for Deep Attentional Structured Representation Learning for Visual Recognition

Figure 3 for Deep Attentional Structured Representation Learning for Visual Recognition

Figure 4 for Deep Attentional Structured Representation Learning for Visual Recognition

Abstract:Structured representations, such as Bags of Words, VLAD and Fisher Vectors, have proven highly effective to tackle complex visual recognition tasks. As such, they have recently been incorporated into deep architectures. However, while effective, the resulting deep structured representation learning strategies typically aggregate local features from the entire image, ignoring the fact that, in complex recognition tasks, some regions provide much more discriminative information than others. In this paper, we introduce an attentional structured representation learning framework that incorporates an image-specific attention mechanism within the aggregation process. Our framework learns to predict jointly the image class label and an attention map in an end-to-end fashion and without any other supervision than the target label. As evidenced by our experiments, this consistently outperforms attention-less structured representation learning and yields state-of-the-art results on standard scene recognition and fine-grained categorization benchmarks.

Via

Access Paper or Ask Questions

Geometry-aware Deep Network for Single-Image Novel View Synthesis

Apr 17, 2018

Miaomiao Liu, Xuming He, Mathieu Salzmann

Figure 1 for Geometry-aware Deep Network for Single-Image Novel View Synthesis

Figure 2 for Geometry-aware Deep Network for Single-Image Novel View Synthesis

Figure 3 for Geometry-aware Deep Network for Single-Image Novel View Synthesis

Figure 4 for Geometry-aware Deep Network for Single-Image Novel View Synthesis

Abstract:This paper tackles the problem of novel view synthesis from a single image. In particular, we target real-world scenes with rich geometric structure, a challenging task due to the large appearance variations of such scenes and the lack of simple 3D models to represent them. Modern, learning-based approaches mostly focus on appearance to synthesize novel views and thus tend to generate predictions that are inconsistent with the underlying scene structure. By contrast, in this paper, we propose to exploit the 3D geometry of the scene to synthesize a novel view. Specifically, we approximate a real-world scene by a fixed number of planes, and learn to predict a set of homographies and their corresponding region masks to transform the input image into a novel view. To this end, we develop a new region-aware geometric transform network that performs these multiple tasks in a common framework. Our results on the outdoor KITTI and the indoor ScanNet datasets demonstrate the effectiveness of our network in generating high quality synthetic views that respect the scene geometry, thus outperforming the state-of-the-art methods.

* CVPR 2018

Via

Access Paper or Ask Questions

Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation

Apr 03, 2018

Helge Rhodin, Mathieu Salzmann, Pascal Fua

Figure 1 for Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation

Figure 2 for Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation

Figure 3 for Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation

Figure 4 for Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation

Abstract:Modern 3D human pose estimation techniques rely on deep networks, which require large amounts of training data. While weakly-supervised methods require less supervision, by utilizing 2D poses or multi-view imagery without annotations, they still need a sufficiently large set of samples with 3D annotations for learning to succeed. In this paper, we propose to overcome this problem by learning a geometry-aware body representation from multi-view images without annotations. To this end, we use an encoder-decoder that predicts an image from one viewpoint given an image from another viewpoint. Because this representation encodes 3D geometry, using it in a semi-supervised setting makes it easier to learn a mapping from it to 3D human pose. As evidenced by our experiments, our approach significantly outperforms fully-supervised methods given the same amount of labeled data, and improves over other semi-supervised methods while using as little as 1% of the labeled data.

Via

Access Paper or Ask Questions

Eigendecomposition-free Training of Deep Networks with Zero Eigenvalue-based Losses

Mar 26, 2018

Zheng Dang, Kwang Moo Yi, Yinlin Hu, Fei Wang, Pascal Fua, Mathieu Salzmann

Figure 1 for Eigendecomposition-free Training of Deep Networks with Zero Eigenvalue-based Losses

Figure 2 for Eigendecomposition-free Training of Deep Networks with Zero Eigenvalue-based Losses

Figure 3 for Eigendecomposition-free Training of Deep Networks with Zero Eigenvalue-based Losses

Figure 4 for Eigendecomposition-free Training of Deep Networks with Zero Eigenvalue-based Losses

Abstract:Many classical Computer Vision problems, such as essential matrix computation and pose estimation from 3D to 2D correspondences, can be solved by finding the eigenvector corresponding to the smallest, or zero, eigenvalue of a matrix representing a linear system. Incorporating this in deep learning frameworks would allow us to explicitly encode known notions of geometry, instead of having the network implicitly learn them from data. However, performing eigendecomposition within a network requires the ability to differentiate this operation. Unfortunately, while theoretically doable, this introduces numerical instability in the optimization process in practice. In this paper, we introduce an eigendecomposition-free approach to training a deep network whose loss depends on the eigenvector corresponding to a zero eigenvalue of a matrix predicted by the network. We demonstrate on several tasks, including keypoint matching and 3D pose estimation, that our approach is much more robust than explicit differentiation of the eigendecomposition, It has better convergence properties and yields state-of-the-art results on both tasks.

* 25 pages

Via

Access Paper or Ask Questions

Learning Monocular 3D Human Pose Estimation from Multi-view Images

Mar 24, 2018

Helge Rhodin, Jörg Spörri, Isinsu Katircioglu, Victor Constantin, Frédéric Meyer, Erich Müller, Mathieu Salzmann, Pascal Fua

Figure 1 for Learning Monocular 3D Human Pose Estimation from Multi-view Images

Figure 2 for Learning Monocular 3D Human Pose Estimation from Multi-view Images

Figure 3 for Learning Monocular 3D Human Pose Estimation from Multi-view Images

Figure 4 for Learning Monocular 3D Human Pose Estimation from Multi-view Images

Abstract:Accurate 3D human pose estimation from single images is possible with sophisticated deep-net architectures that have been trained on very large datasets. However, this still leaves open the problem of capturing motions for which no such database exists. Manual annotation is tedious, slow, and error-prone. In this paper, we propose to replace most of the annotations by the use of multiple views, at training time only. Specifically, we train the system to predict the same pose in all views. Such a consistency constraint is necessary but not sufficient to predict accurate poses. We therefore complement it with a supervised loss aiming to predict the correct pose in a small set of labeled images, and with a regularization term that penalizes drift from initial predictions. Furthermore, we propose a method to estimate camera pose jointly with human pose, which lets us utilize multi-view footage where calibration is difficult, e.g., for pan-tilt or moving handheld cameras. We demonstrate the effectiveness of our approach on established benchmarks, as well as on a new Ski dataset with rotating cameras and expert ski motion, for which annotations are truly hard to obtain.

* CVPR 2018, Ski-Pose PTZ-Camera Dataset available

Via

Access Paper or Ask Questions

Geometric and Physical Constraints for Head Plane Crowd Density Estimation in Videos

Mar 23, 2018

Weizhe Liu, Krzysztof Lis, Mathieu Salzmann, Pascal Fua

Figure 1 for Geometric and Physical Constraints for Head Plane Crowd Density Estimation in Videos

Figure 2 for Geometric and Physical Constraints for Head Plane Crowd Density Estimation in Videos

Figure 3 for Geometric and Physical Constraints for Head Plane Crowd Density Estimation in Videos

Figure 4 for Geometric and Physical Constraints for Head Plane Crowd Density Estimation in Videos

Abstract:State-of-the-art methods of people counting in crowded scenes rely on deep networks to estimate people density in the image plane. Perspective distortion effects are handled implicitly by either learning scale-invariant features or estimating density in patches of different sizes, neither of which accounts for the fact that scale changes must be consistent over the whole scene. In this paper, we show that feeding an explicit model of the scale changes to the network considerably increases performance. An added benefit is that it lets us reason in terms of number of people per square meter on the ground, allowing us to enforce physically-inspired temporal consistency constraints that do not have to be learned. This yields an algorithm that outperforms state-of-the-art methods on crowded scenes, especially when perspective effects are strong.

Via

Access Paper or Ask Questions

Residual Parameter Transfer for Deep Domain Adaptation

Nov 21, 2017

Artem Rozantsev, Mathieu Salzmann, Pascal Fua

Figure 1 for Residual Parameter Transfer for Deep Domain Adaptation

Figure 2 for Residual Parameter Transfer for Deep Domain Adaptation

Figure 3 for Residual Parameter Transfer for Deep Domain Adaptation

Figure 4 for Residual Parameter Transfer for Deep Domain Adaptation

Abstract:The goal of Deep Domain Adaptation is to make it possible to use Deep Nets trained in one domain where there is enough annotated training data in another where there is little or none. Most current approaches have focused on learning feature representations that are invariant to the changes that occur when going from one domain to the other, which means using the same network parameters in both domains. While some recent algorithms explicitly model the changes by adapting the network parameters, they either severely restrict the possible domain changes, or significantly increase the number of model parameters. By contrast, we introduce a network architecture that includes auxiliary residual networks, which we train to predict the parameters in the domain with little annotated data from those in the other one. This architecture enables us to flexibly preserve the similarities between domains where they exist and model the differences when necessary. We demonstrate that our approach yields higher accuracy than state-of-the-art methods without undue complexity.

Via

Access Paper or Ask Questions

Compression-aware Training of Deep Networks

Nov 13, 2017

Jose M. Alvarez, Mathieu Salzmann

Figure 1 for Compression-aware Training of Deep Networks

Figure 2 for Compression-aware Training of Deep Networks

Figure 3 for Compression-aware Training of Deep Networks

Figure 4 for Compression-aware Training of Deep Networks

Abstract:In recent years, great progress has been made in a variety of application domains thanks to the development of increasingly deeper neural networks. Unfortunately, the huge number of units of these networks makes them expensive both computationally and memory-wise. To overcome this, exploiting the fact that deep networks are over-parametrized, several compression strategies have been proposed. These methods, however, typically start from a network that has been trained in a standard manner, without considering such a future compression. In this paper, we propose to explicitly account for compression in the training process. To this end, we introduce a regularizer that encourages the parameter matrix of each layer to have low rank during training. We show that accounting for compression during training allows us to learn much more compact, yet at least as effective, models than state-of-the-art compression techniques.

* Accepted at NIPS 2017

Via

Access Paper or Ask Questions

Soft Correspondences in Multimodal Scene Parsing

Sep 28, 2017

Sarah Taghavi Namin, Mohammad Najafi, Mathieu Salzmann, Lars Petersson

Figure 1 for Soft Correspondences in Multimodal Scene Parsing

Figure 2 for Soft Correspondences in Multimodal Scene Parsing

Figure 3 for Soft Correspondences in Multimodal Scene Parsing

Figure 4 for Soft Correspondences in Multimodal Scene Parsing

Abstract:Exploiting multiple modalities for semantic scene parsing has been shown to improve accuracy over the singlemodality scenario. However multimodal datasets often suffer from problems such as data misalignment and label inconsistencies, where the existing methods assume that corresponding regions in two modalities must have identical labels. We propose to address this issue, by formulating multimodal semantic labeling as inference in a CRF and introducing latent nodes to explicitly model inconsistencies between two modalities. These latent nodes allow us not only to leverage information from both domains to improve their labeling, but also to cut the edges between inconsistent regions. We propose to learn intradomain and inter-domain potential functions from training data to avoid hand-tuning of the model parameters. We evaluate our approach on two publicly available datasets containing 2D and 3D data. Thanks to our latent nodes and our learning strategy, our method outperforms the state-of-the-art in both cases. Moreover, in order to highlight the benefits of the geometric information and the potential of our method in simultaneous 2D/3D semantic and geometric inference, we performed simultaneous inference of semantic and geometric classes both in 2D and 3D that led to satisfactory improvements of the labeling results in both datasets.

* 16 pages

Via

Access Paper or Ask Questions

Non-Linear Subspace Clustering with Learned Low-Rank Kernels

Sep 20, 2017

Pan Ji, Ian Reid, Ravi Garg, Hongdong Li, Mathieu Salzmann

Figure 1 for Non-Linear Subspace Clustering with Learned Low-Rank Kernels

Figure 2 for Non-Linear Subspace Clustering with Learned Low-Rank Kernels

Figure 3 for Non-Linear Subspace Clustering with Learned Low-Rank Kernels

Figure 4 for Non-Linear Subspace Clustering with Learned Low-Rank Kernels

Abstract:In this paper, we present a kernel subspace clustering method that can handle non-linear models. In contrast to recent kernel subspace clustering methods which use predefined kernels, we propose to learn a low-rank kernel matrix, with which mapped data in feature space are not only low-rank but also self-expressive. In this manner, the low-dimensional subspace structures of the (implicitly) mapped data are retained and manifested in the high-dimensional feature space. We evaluate the proposed method extensively on both motion segmentation and image clustering benchmarks, and obtain superior results, outperforming the kernel subspace clustering method that uses standard kernels[Patel 2014] and other state-of-the-art linear subspace clustering methods.

Via

Access Paper or Ask Questions