Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mathieu Salzmann

CVLab EPFL Switzerland

Indirect Local Attacks for Context-aware Semantic Segmentation Networks

Dec 02, 2019

Krishna Kanth Nakka, Mathieu Salzmann

Figure 1 for Indirect Local Attacks for Context-aware Semantic Segmentation Networks

Figure 2 for Indirect Local Attacks for Context-aware Semantic Segmentation Networks

Figure 3 for Indirect Local Attacks for Context-aware Semantic Segmentation Networks

Figure 4 for Indirect Local Attacks for Context-aware Semantic Segmentation Networks

Abstract:Recently, deep networks have achieved impressive semantic segmentation performance, in particular thanks to their use of larger contextual information. In this paper, we show that the resulting networks are sensitive not only to global attacks, where perturbations affect the entire input image, but also to indirect local attacks where perturbations are confined to a small image region that does not overlap with the area that we aim to fool. To this end, we introduce several indirect attack strategies, including adaptive local attacks, aiming to find the best image location to perturb, and universal local attacks. Furthermore, we propose attack detection techniques both for the global image level and to obtain a pixel-wise localization of the fooled regions. Our results are unsettling: Because they exploit a larger context, more accurate semantic segmentation networks are more sensitive to indirect local attacks.

Via

Access Paper or Ask Questions

Using Depth for Pixel-Wise Detection of Adversarial Attacks in Crowd Counting

Nov 26, 2019

Weizhe Liu, Mathieu Salzmann, Pascal Fua

Figure 1 for Using Depth for Pixel-Wise Detection of Adversarial Attacks in Crowd Counting

Figure 2 for Using Depth for Pixel-Wise Detection of Adversarial Attacks in Crowd Counting

Figure 3 for Using Depth for Pixel-Wise Detection of Adversarial Attacks in Crowd Counting

Figure 4 for Using Depth for Pixel-Wise Detection of Adversarial Attacks in Crowd Counting

Abstract:State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate crowd density. While effective, deep learning approaches are vulnerable to adversarial attacks, which, in a crowd-counting context, can lead to serious security issues. However, attack and defense mechanisms have been virtually unexplored in regression tasks, let alone for crowd density estimation. In this paper, we investigate the effectiveness of existing attack strategies on crowd-counting networks, and introduce a simple yet effective pixel-wise detection mechanism. It builds on the intuition that, when attacking a multitask network, in our case estimating crowd density and scene depth, both outputs will be perturbed, and thus the second one can be used for detection purposes. We will demonstrate that this significantly outperforms heuristic-based and uncertainty-based strategies.

Via

Access Paper or Ask Questions

Shape Reconstruction by Learning Differentiable Surface Representations

Nov 25, 2019

Jan Bednarik, Shaifali Parashar, Erhan Gundogdu, Mathieu Salzmann, Pascal Fua

Figure 1 for Shape Reconstruction by Learning Differentiable Surface Representations

Figure 2 for Shape Reconstruction by Learning Differentiable Surface Representations

Figure 3 for Shape Reconstruction by Learning Differentiable Surface Representations

Figure 4 for Shape Reconstruction by Learning Differentiable Surface Representations

Abstract:Generative models that produce point clouds have emerged as a powerful tool to represent 3D surfaces, and the best current ones rely on learning an ensemble of parametric representations. Unfortunately, they offer no control over the deformations of the surface patches that form the ensemble and thus fail to prevent them from either overlapping or collapsing into single points or lines. As a consequence, computing shape properties such as surface normals and curvatures becomes difficult and unreliable. In this paper, we show that we can exploit the inherent differentiability of deep networks to leverage differential surface properties during training so as to prevent patch collapse and strongly reduce patch overlap. Furthermore, this lets us reliably compute quantities such as surface normals and curvatures. We will demonstrate on several tasks that this yields more accurate surface reconstructions than the state-of-the-art methods in terms of normals estimation and amount of collapsed and overlapped patches.

* 14 pages

Via

Access Paper or Ask Questions

Estimating People Flows to Better Count them in Crowded Scenes

Nov 25, 2019

Weizhe Liu, Mathieu Salzmann, Pascal Fua

Figure 1 for Estimating People Flows to Better Count them in Crowded Scenes

Figure 2 for Estimating People Flows to Better Count them in Crowded Scenes

Figure 3 for Estimating People Flows to Better Count them in Crowded Scenes

Figure 4 for Estimating People Flows to Better Count them in Crowded Scenes

Abstract:State-of-the-art methods for counting people in crowded scenes rely on deep networks to estimate people densities in individual images. As such, only very few take advantage of temporal consistency in video sequences, and those that do only impose weak smoothness constraints across consecutive frames. In this paper, we show that estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing them makes it possible to impose much stronger constraints encoding the conservation of the number of people, which significantly boost performance without requiring a more complex architecture. Furthermore, it also enables us to exploit the correlation between people flow and optical flow to further improve the results. We will demonstrate that we consistently outperform state-of-the-art methods on five benchmark datasets.

Via

Access Paper or Ask Questions

Single-Stage 6D Object Pose Estimation

Nov 19, 2019

Yinlin Hu, Pascal Fua, Wei Wang, Mathieu Salzmann

Figure 1 for Single-Stage 6D Object Pose Estimation

Figure 2 for Single-Stage 6D Object Pose Estimation

Figure 3 for Single-Stage 6D Object Pose Estimation

Figure 4 for Single-Stage 6D Object Pose Estimation

Abstract:Most recent 6D pose estimation frameworks first rely on a deep network to establish correspondences between 3D object keypoints and 2D image locations and then use a variant of a RANSAC-based Perspective-n-Point (PnP) algorithm. This two-stage process, however, is suboptimal: First, it is not end-to-end trainable. Second, training the deep network relies on a surrogate loss that does not directly reflect the final 6D pose estimation task. In this work, we introduce a deep architecture that directly regresses 6D poses from correspondences. It takes as input a group of candidate correspondences for each 3D keypoint and accounts for the fact that the order of the correspondences within each group is irrelevant, while the order of the groups, that is, of the 3D keypoints, is fixed. Our architecture is generic and can thus be exploited in conjunction with existing correspondence-extraction networks so as to yield single-stage 6D pose estimation frameworks. Our experiments demonstrate that these single-stage frameworks consistently outperform their two-stage counterparts in terms of both accuracy and speed.

Via

Access Paper or Ask Questions

Field typing for improved recognition on heterogeneous handwritten forms

Sep 23, 2019

Ciprian Tomoiaga, Paul Feng, Mathieu Salzmann, Patrick Jayet

Figure 1 for Field typing for improved recognition on heterogeneous handwritten forms

Figure 2 for Field typing for improved recognition on heterogeneous handwritten forms

Figure 3 for Field typing for improved recognition on heterogeneous handwritten forms

Figure 4 for Field typing for improved recognition on heterogeneous handwritten forms

Abstract:Offline handwriting recognition has undergone continuous progress over the past decades. However, existing methods are typically benchmarked on free-form text datasets that are biased towards good-quality images and handwriting styles, and homogeneous content. In this paper, we show that state-of-the-art algorithms, employing long short-term memory (LSTM) layers, do not readily generalize to real-world structured documents, such as forms, due to their highly heterogeneous and out-of-vocabulary content, and to the inherent ambiguities of this content. To address this, we propose to leverage the content type within an LSTM-based architecture. Furthermore, we introduce a procedure to generate synthetic data to train this architecture without requiring expensive manual annotations. We demonstrate the effectiveness of our approach at transcribing text on a challenging, real-world dataset of European Accident Statements.

Via

Access Paper or Ask Questions

Learning Trajectory Dependencies for Human Motion Prediction

Aug 16, 2019

Wei Mao, Miaomiao Liu, Mathieu Salzmann, Hongdong Li

Figure 1 for Learning Trajectory Dependencies for Human Motion Prediction

Figure 2 for Learning Trajectory Dependencies for Human Motion Prediction

Figure 3 for Learning Trajectory Dependencies for Human Motion Prediction

Figure 4 for Learning Trajectory Dependencies for Human Motion Prediction

Abstract:Human motion prediction, i.e., forecasting future body poses given observed pose sequence, has typically been tackled with recurrent neural networks (RNNs). However, as evidenced by prior work, the resulted RNN models suffer from prediction errors accumulation, leading to undesired discontinuities in motion prediction. In this paper, we propose a simple feed-forward deep network for motion prediction, which takes into account both temporal smoothness and spatial dependencies among human body joints. In this context, we then propose to encode temporal information by working in trajectory space, instead of the traditionally-used pose space. This alleviates us from manually defining the range of temporal dependencies (or temporal convolutional filter size, as done in previous work). Moreover, spatial dependency of human pose is encoded by treating a human pose as a generic graph (rather than a human skeletal kinematic tree) formed by links between every pair of body joints. Instead of using a pre-defined graph structure, we design a new graph convolutional network to learn graph connectivity automatically. This allows the network to capture long range dependencies beyond that of human kinematic tree. We evaluate our approach on several standard benchmark datasets for motion prediction, including Human3.6M, the CMU motion capture dataset and 3DPW. Our experiments clearly demonstrate that the proposed approach achieves state of the art performance, and is applicable to both angle-based and position-based pose representations. The code is available at https://github.com/wei-mao-2019/LearnTrajDep

* Accepted by ICCV2019(Oral)

Via

Access Paper or Ask Questions

Learning Variations in Human Motion via Mix-and-Match Perturbation

Aug 02, 2019

Mohammad Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Lars Petersson, Stephen Gould, Amirhossein Habibian

Figure 1 for Learning Variations in Human Motion via Mix-and-Match Perturbation

Figure 2 for Learning Variations in Human Motion via Mix-and-Match Perturbation

Figure 3 for Learning Variations in Human Motion via Mix-and-Match Perturbation

Figure 4 for Learning Variations in Human Motion via Mix-and-Match Perturbation

Abstract:Human motion prediction is a stochastic process: Given an observed sequence of poses, multiple future motions are plausible. Existing approaches to modeling this stochasticity typically combine a random noise vector with information about the previous poses. This combination, however, is done in a deterministic manner, which gives the network the flexibility to learn to ignore the random noise. In this paper, we introduce an approach to stochastically combine the root of variations with previous pose information, which forces the model to take the noise into account. We exploit this idea for motion prediction by incorporating it into a recurrent encoder-decoder network with a conditional variational autoencoder block that learns to exploit the perturbations. Our experiments demonstrate that our model yields high-quality pose sequences that are much more diverse than those from state-of-the-art stochastic motion prediction techniques.

Via

Access Paper or Ask Questions

Self-supervised Training of Proposal-based Segmentation via Background Prediction

Jul 18, 2019

Isinsu Katircioglu, Helge Rhodin, Victor Constantin, Jörg Spörri, Mathieu Salzmann, Pascal Fua

Figure 1 for Self-supervised Training of Proposal-based Segmentation via Background Prediction

Figure 2 for Self-supervised Training of Proposal-based Segmentation via Background Prediction

Figure 3 for Self-supervised Training of Proposal-based Segmentation via Background Prediction

Figure 4 for Self-supervised Training of Proposal-based Segmentation via Background Prediction

Abstract:While supervised object detection methods achieve impressive accuracy, they generalize poorly to images whose appearance significantly differs from the data they have been trained on. To address this in scenarios where annotating data is prohibitively expensive, we introduce a self-supervised approach to object detection and segmentation, able to work with monocular images captured with a moving camera. At the heart of our approach lies the observation that segmentation and background reconstruction are linked tasks, and the idea that, because we observe a structured scene, background regions can be re-synthesized from their surroundings, whereas regions depicting the object cannot. We therefore encode this intuition as a self-supervised loss function that we exploit to train a proposal-based segmentation network. To account for the discrete nature of object proposals, we develop a Monte Carlo-based training strategy that allows us to explore the large space of object proposals. Our experiments demonstrate that our approach yields accurate detections and segmentations in images that visually depart from those of standard benchmarks, outperforming existing self-supervised methods and approaching weakly supervised ones that exploit large annotated datasets.

Via

Access Paper or Ask Questions

Backpropagation-Friendly Eigendecomposition

Jun 27, 2019

Wei Wang, Zheng Dang, Yinlin Hu, Pascal Fua, Mathieu Salzmann

Figure 1 for Backpropagation-Friendly Eigendecomposition

Figure 2 for Backpropagation-Friendly Eigendecomposition

Figure 3 for Backpropagation-Friendly Eigendecomposition

Figure 4 for Backpropagation-Friendly Eigendecomposition

Abstract:Eigendecomposition (ED) is widely used in deep networks. However, the backpropagation of its results tends to be numerically unstable, whether using ED directly or approximating it with the Power Iteration method, particularly when dealing with large matrices. While this can be mitigated by partitioning the data in small and arbitrary groups, doing so has no theoretical basis and makes its impossible to exploit the power of ED to the full. In this paper, we introduce a numerically stable and differentiable approach to leveraging eigenvectors in deep networks. It can handle large matrices without requiring to split them. We demonstrate the better robustness of our approach over standard ED and PI for ZCA whitening, an alternative to batch normalization, and for PCA denoising, which we introduce as a new normalization strategy for deep networks, aiming to further denoise the network's features.

Via

Access Paper or Ask Questions