Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sanja Fidler

NVIDIA, University of Toronto, Vector Institute

Video Face Clustering with Unknown Number of Clusters

Aug 20, 2019

Makarand Tapaswi, Marc T. Law, Sanja Fidler

Figure 1 for Video Face Clustering with Unknown Number of Clusters

Figure 2 for Video Face Clustering with Unknown Number of Clusters

Figure 3 for Video Face Clustering with Unknown Number of Clusters

Figure 4 for Video Face Clustering with Unknown Number of Clusters

Abstract:Understanding videos such as TV series and movies requires analyzing who the characters are and what they are doing. We address the challenging problem of clustering face tracks based on their identity. Different from previous work in this area, we choose to operate in a realistic and difficult setting where: (i) the number of characters is not known a priori; and (ii) face tracks belonging to minor or background characters are not discarded. To this end, we propose Ball Cluster Learning (BCL), a supervised approach to carve the embedding space into balls of equal size, one for each cluster. The learned ball radius is easily translated to a stopping criterion for iterative merging algorithms. This gives BCL the ability to estimate the number of clusters as well as their assignment, achieving promising results on commonly used datasets. We also present a thorough discussion of how existing metric learning literature can be adapted for this task.

* Accepted to ICCV 2019, code and data at https://github.com/makarandtapaswi/BallClustering_ICCV2019

Via

Access Paper or Ask Questions

Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer

Aug 03, 2019

Wenzheng Chen, Jun Gao, Huan Ling, Edward J. Smith, Jaakko Lehtinen, Alec Jacobson, Sanja Fidler

Figure 1 for Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer

Figure 2 for Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer

Figure 3 for Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer

Figure 4 for Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer

Abstract:Many machine learning models operate on images, but ignore the fact that images are 2D projections formed by 3D geometry interacting with light, in a process called rendering. Enabling ML models to understand image formation might be key for generalization. However, due to an essential rasterization step involving discrete assignment operations, rendering pipelines are non-differentiable and thus largely inaccessible to gradient-based ML techniques. In this paper, we present DIB-R, a differentiable rendering framework which allows gradients to be analytically computed for all pixels in an image. Key to our approach is to view foreground rasterization as a weighted interpolation of local properties and background rasterization as an distance-based aggregation of global geometry. Our approach allows for accurate optimization over vertex positions, colors, normals, light directions and texture coordinates through a variety of lighting models. We showcase our approach in two ML applications: single-image 3D object prediction, and 3D textured object generation, both trained using exclusively using 2D supervision. Our project website is: https://nv-tlabs.github.io/DIB-R/

* https://nv-tlabs.github.io/DIB-R/

Via

Access Paper or Ask Questions

Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

Jul 12, 2019

Towaki Takikawa, David Acuna, Varun Jampani, Sanja Fidler

Figure 1 for Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

Figure 2 for Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

Figure 3 for Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

Figure 4 for Gated-SCNN: Gated Shape CNNs for Semantic Segmentation

Abstract:Current state-of-the-art methods for image segmentation form a dense image representation where the color, shape and texture information are all processed together inside a deep CNN. This however may not be ideal as they contain very different type of information relevant for recognition. Here, we propose a new two-stream CNN architecture for semantic segmentation that explicitly wires shape information as a separate processing branch, i.e. shape stream, that processes information in parallel to the classical stream. Key to this architecture is a new type of gates that connect the intermediate layers of the two streams. Specifically, we use the higher-level activations in the classical stream to gate the lower-level activations in the shape stream, effectively removing noise and helping the shape stream to only focus on processing the relevant boundary-related information. This enables us to use a very shallow architecture for the shape stream that operates on the image-level resolution. Our experiments show that this leads to a highly effective architecture that produces sharper predictions around object boundaries and significantly boosts performance on thinner and smaller objects. Our method achieves state-of-the-art performance on the Cityscapes benchmark, in terms of both mask (mIoU) and boundary (F-score) quality, improving by 2% and 4% over strong baselines.

* Project Website: https://nv-tlabs.github.io/GSCNN/

Via

Access Paper or Ask Questions

Neural Graph Evolution: Towards Efficient Automatic Robot Design

Jun 12, 2019

Tingwu Wang, Yuhao Zhou, Sanja Fidler, Jimmy Ba

Figure 1 for Neural Graph Evolution: Towards Efficient Automatic Robot Design

Figure 2 for Neural Graph Evolution: Towards Efficient Automatic Robot Design

Figure 3 for Neural Graph Evolution: Towards Efficient Automatic Robot Design

Figure 4 for Neural Graph Evolution: Towards Efficient Automatic Robot Design

Abstract:Despite the recent successes in robotic locomotion control, the design of robot relies heavily on human engineering. Automatic robot design has been a long studied subject, but the recent progress has been slowed due to the large combinatorial search space and the difficulty in evaluating the found candidates. To address the two challenges, we formulate automatic robot design as a graph search problem and perform evolution search in graph space. We propose Neural Graph Evolution (NGE), which performs selection on current candidates and evolves new ones iteratively. Different from previous approaches, NGE uses graph neural networks to parameterize the control policies, which reduces evaluation cost on new candidates with the help of skill transfer from previously evaluated designs. In addition, NGE applies Graph Mutation with Uncertainty (GM-UC) by incorporating model uncertainty, which reduces the search space by balancing exploration and exploitation. We show that NGE significantly outperforms previous methods by an order of magnitude. As shown in experiments, NGE is the first algorithm that can automatically discover kinematically preferred robotic graph structures, such as a fish with two symmetrical flat side-fins and a tail, or a cheetah with athletic front and back legs. Instead of using thousands of cores for weeks, NGE efficiently solves searching problem within a day on a single 64 CPU-core Amazon EC2 machine.

* ICLR 2019

Via

Access Paper or Ask Questions

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

May 15, 2019

Chaoqi Wang, Roger Grosse, Sanja Fidler, Guodong Zhang

Figure 1 for EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

Figure 2 for EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

Abstract:Reducing the test time resource requirements of a neural network while preserving test accuracy is crucial for running inference on resource-constrained devices. To achieve this goal, we introduce a novel network reparameterization based on the Kronecker-factored eigenbasis (KFE), and then apply Hessian-based structured pruning methods in this basis. As opposed to existing Hessian-based pruning algorithms which do pruning in parameter coordinates, our method works in the KFE where different weights are approximately independent, enabling accurate pruning and fast computation. We demonstrate empirically the effectiveness of the proposed method through extensive experiments. In particular, we highlight that the improvements are especially significant for more challenging datasets and networks. With negligible loss of accuracy, an iterative-pruning version gives a 10$\times$ reduction in model size and a 8$\times$ reduction in FLOPs on wide ResNet32.

* ICML 2019

Via

Access Paper or Ask Questions

DARNet: Deep Active Ray Network for Building Segmentation

May 15, 2019

Dominic Cheng, Renjie Liao, Sanja Fidler, Raquel Urtasun

Figure 1 for DARNet: Deep Active Ray Network for Building Segmentation

Figure 2 for DARNet: Deep Active Ray Network for Building Segmentation

Figure 3 for DARNet: Deep Active Ray Network for Building Segmentation

Figure 4 for DARNet: Deep Active Ray Network for Building Segmentation

Abstract:In this paper, we propose a Deep Active Ray Network (DARNet) for automatic building segmentation. Taking an image as input, it first exploits a deep convolutional neural network (CNN) as the backbone to predict energy maps, which are further utilized to construct an energy function. A polygon-based contour is then evolved via minimizing the energy function, of which the minimum defines the final segmentation. Instead of parameterizing the contour using Euclidean coordinates, we adopt polar coordinates, i.e., rays, which not only prevents self-intersection but also simplifies the design of the energy function. Moreover, we propose a loss function that directly encourages the contours to match building boundaries. Our DARNet is trained end-to-end by back-propagating through the energy minimization and the backbone CNN, which makes the CNN adapt to the dynamics of the contour evolution. Experiments on three building instance segmentation datasets demonstrate our DARNet achieves either state-of-the-art or comparable performances to other competitors.

* CVPR 2019

Via

Access Paper or Ask Questions

Meta-Sim: Learning to Generate Synthetic Datasets

Apr 25, 2019

Amlan Kar, Aayush Prakash, Ming-Yu Liu, Eric Cameracci, Justin Yuan, Matt Rusiniak, David Acuna, Antonio Torralba, Sanja Fidler

Figure 1 for Meta-Sim: Learning to Generate Synthetic Datasets

Figure 2 for Meta-Sim: Learning to Generate Synthetic Datasets

Figure 3 for Meta-Sim: Learning to Generate Synthetic Datasets

Figure 4 for Meta-Sim: Learning to Generate Synthetic Datasets

Abstract:Training models to high-end performance requires availability of large labeled datasets, which are expensive to get. The goal of our work is to automatically synthesize labeled datasets that are relevant for a downstream task. We propose Meta-Sim, which learns a generative model of synthetic scenes, and obtain images as well as its corresponding ground-truth via a graphics engine. We parametrize our dataset generator with a neural network, which learns to modify attributes of scene graphs obtained from probabilistic scene grammars, so as to minimize the distribution gap between its rendered outputs and target data. If the real dataset comes with a small labeled validation set, we additionally aim to optimize a meta-objective, i.e. downstream task performance. Experiments show that the proposed method can greatly improve content generation quality over a human-engineered probabilistic scene grammar, both qualitatively and quantitatively as measured by performance on a downstream task.

* Webpage: https://nv-tlabs.github.io/meta-sim/

Via

Access Paper or Ask Questions

Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations

Apr 16, 2019

David Acuna, Amlan Kar, Sanja Fidler

Figure 1 for Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations

Figure 2 for Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations

Figure 3 for Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations

Figure 4 for Devil is in the Edges: Learning Semantic Boundaries from Noisy Annotations

Abstract:We tackle the problem of semantic boundary prediction, which aims to identify pixels that belong to object(class) boundaries. We notice that relevant datasets consist of a significant level of label noise, reflecting the fact that precise annotations are laborious to get and thus annotators trade-off quality with efficiency. We aim to learn sharp and precise semantic boundaries by explicitly reasoning about annotation noise during training. We propose a simple new layer and loss that can be used with existing learning-based boundary detectors. Our layer/loss enforces the detector to predict a maximum response along the normal direction at an edge, while also regularizing its direction. We further reason about true object boundaries during training using a level set formulation, which allows the network to learn from misaligned labels in an end-to-end fashion. Experiments show that we improve over the CASENet backbone network by more than 4% in terms of MF(ODS) and 18.61% in terms of AP, outperforming all current state-of-the-art methods including those that deal with alignment. Furthermore, we show that our learned network can be used to significantly improve coarse segmentation labels, lending itself as an efficient way to label new data.

* CVPR 2019
* Accepted as a CVPR 2019 oral paper (Project Page: https://nv-tlabs.github.io/STEAL/)

Via

Access Paper or Ask Questions

Action Recognition from Single Timestamp Supervision in Untrimmed Videos

Apr 09, 2019

Davide Moltisanti, Sanja Fidler, Dima Damen

Figure 1 for Action Recognition from Single Timestamp Supervision in Untrimmed Videos

Figure 2 for Action Recognition from Single Timestamp Supervision in Untrimmed Videos

Figure 3 for Action Recognition from Single Timestamp Supervision in Untrimmed Videos

Figure 4 for Action Recognition from Single Timestamp Supervision in Untrimmed Videos

Abstract:Recognising actions in videos relies on labelled supervision during training, typically the start and end times of each action instance. This supervision is not only subjective, but also expensive to acquire. Weak video-level supervision has been successfully exploited for recognition in untrimmed videos, however it is challenged when the number of different actions in training videos increases. We propose a method that is supervised by single timestamps located around each action instance, in untrimmed videos. We replace expensive action bounds with sampling distributions initialised from these timestamps. We then use the classifier's response to iteratively update the sampling distributions. We demonstrate that these distributions converge to the location and extent of discriminative action segments. We evaluate our method on three datasets for fine-grained recognition, with increasing number of different actions per video, and show that single timestamps offer a reasonable compromise between recognition performance and labelling effort, performing comparably to full temporal supervision. Our update method improves top-1 test accuracy by up to 5.4%. across the evaluated datasets.

* CVPR 2019

Via

Access Paper or Ask Questions

Mimicking the In-Camera Color Pipeline for Camera-Aware Object Compositing

Mar 27, 2019

Jun Gao, Xiao Li, Liwei Wang, Sanja Fidler, Stephen Lin

Figure 1 for Mimicking the In-Camera Color Pipeline for Camera-Aware Object Compositing

Figure 2 for Mimicking the In-Camera Color Pipeline for Camera-Aware Object Compositing

Figure 3 for Mimicking the In-Camera Color Pipeline for Camera-Aware Object Compositing

Figure 4 for Mimicking the In-Camera Color Pipeline for Camera-Aware Object Compositing

Abstract:We present a method for compositing virtual objects into a photograph such that the object colors appear to have been processed by the photo's camera imaging pipeline. Compositing in such a camera-aware manner is essential for high realism, and it requires the color transformation in the photo's pipeline to be inferred, which is challenging due to the inherent one-to-many mapping that exists from a scene to a photo. To address this problem for the case of a single photo taken from an unknown camera, we propose a dual-learning approach in which the reverse color transformation (from the photo to the scene) is jointly estimated. Learning of the reverse transformation is used to facilitate learning of the forward mapping, by enforcing cycle consistency of the two processes. We additionally employ a feature sharing schema to extract evidence from the target photo in the reverse mapping to guide the forward color transformation. Our dual-learning approach achieves object compositing results that surpass those of alternative techniques.

Via

Access Paper or Ask Questions