Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sanja Fidler

NVIDIA, University of Toronto, Vector Institute

Learning to Simulate Dynamic Environments with GameGAN

May 25, 2020

Seung Wook Kim, Yuhao Zhou, Jonah Philion, Antonio Torralba, Sanja Fidler

Figure 1 for Learning to Simulate Dynamic Environments with GameGAN

Figure 2 for Learning to Simulate Dynamic Environments with GameGAN

Figure 3 for Learning to Simulate Dynamic Environments with GameGAN

Figure 4 for Learning to Simulate Dynamic Environments with GameGAN

Abstract:Simulation is a crucial component of any robotic system. In order to simulate correctly, we need to write complex rules of the environment: how dynamic agents behave, and how the actions of each of the agents affect the behavior of others. In this paper, we aim to learn a simulator by simply watching an agent interact with an environment. We focus on graphics games as a proxy of the real environment. We introduce GameGAN, a generative model that learns to visually imitate a desired game by ingesting screenplay and keyboard actions during training. Given a key pressed by the agent, GameGAN "renders" the next screen using a carefully designed generative adversarial network. Our approach offers key advantages over existing work: we design a memory module that builds an internal map of the environment, allowing for the agent to return to previously visited locations with high visual consistency. In addition, GameGAN is able to disentangle static and dynamic components within an image making the behavior of the model more interpretable, and relevant for downstream tasks that require explicit reasoning over dynamic elements. This enables many interesting applications such as swapping different components of the game to build new games that do not exist.

* CVPR 2020

Via

Access Paper or Ask Questions

The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Apr 29, 2020

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price(+1 more)

Figure 1 for The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Figure 2 for The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Figure 3 for The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Figure 4 for The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Abstract:Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions. Our videos depict nonscripted daily activities, as recording is started every time a participant entered their kitchen. Recording took place in 4 countries by participants belonging to 10 different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos after recording, thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and. anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. We introduce new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions e.g. 'closing a tap' from 'opening' it up.

* Preprint for paper at IEEE TPAMI. arXiv admin note: substantial text overlap with arXiv:1804.02748

Via

Access Paper or Ask Questions

Learning to Evaluate Perception Models Using Planner-Centric Metrics

Apr 19, 2020

Jonah Philion, Amlan Kar, Sanja Fidler

Figure 1 for Learning to Evaluate Perception Models Using Planner-Centric Metrics

Figure 2 for Learning to Evaluate Perception Models Using Planner-Centric Metrics

Figure 3 for Learning to Evaluate Perception Models Using Planner-Centric Metrics

Figure 4 for Learning to Evaluate Perception Models Using Planner-Centric Metrics

Abstract:Variants of accuracy and precision are the gold-standard by which the computer vision community measures progress of perception algorithms. One reason for the ubiquity of these metrics is that they are largely task-agnostic; we in general seek to detect zero false negatives or positives. The downside of these metrics is that, at worst, they penalize all incorrect detections equally without conditioning on the task or scene, and at best, heuristics need to be chosen to ensure that different mistakes count differently. In this paper, we propose a principled metric for 3D object detection specifically for the task of self-driving. The core idea behind our metric is to isolate the task of object detection and measure the impact the produced detections would induce on the downstream task of driving. Without hand-designing it to, we find that our metric penalizes many of the mistakes that other metrics penalize by design. In addition, our metric downweighs detections based on additional factors such as distance from a detection to the ego car and the speed of the detection in intuitive ways that other detection metrics do not. For human evaluation, we generate scenes in which standard metrics and our metric disagree and find that humans side with our metric 79% of the time. Our project page including an evaluation server can be found at https://nv-tlabs.github.io/detection-relevance.

* CVPR 2020 poster

Via

Access Paper or Ask Questions

Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

Jan 09, 2020

Xi Yan, David Acuna, Sanja Fidler

Figure 1 for Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

Figure 2 for Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

Figure 3 for Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

Figure 4 for Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

Abstract:Transfer learning has proven to be a successful technique to train deep learning models in the domains where little training data is available. The dominant approach is to pretrain a model on a large generic dataset such as ImageNet and finetune its weights on the target domain. However, in the new era of an ever-increasing number of massive datasets, selecting the relevant data for pretraining is a critical issue. We introduce Neural Data Server (NDS), a large-scale search engine for finding the most useful transfer learning data to the target domain. Our NDS consists of a dataserver which indexes several large popular image datasets, and aims to recommend data to a client, an end-user with a target application with its own small labeled dataset. As in any search engine that serves information to possibly numerous users, we want the online computation performed by the dataserver to be minimal. The dataserver represents large datasets with a much more compact mixture-of experts model, and employs it to perform data search in a series of dataserver-client transactions at a low computational cost. We show the effectiveness of NDS in various transfer learning scenarios, demonstrating state-of-the-art performance on several target datasets and tasks such as image classification, object detection and instance segmentation. Our Neural Data Server is available as a web-service at http://aidemos.cs.toronto.edu/nds/, recommending data to users with the aim to improve performance of their A.I. application.

Via

Access Paper or Ask Questions

The Shmoop Corpus: A Dataset of Stories with Loosely Aligned Summaries

Jan 01, 2020

Atef Chaudhury, Makarand Tapaswi, Seung Wook Kim, Sanja Fidler

Figure 1 for The Shmoop Corpus: A Dataset of Stories with Loosely Aligned Summaries

Figure 2 for The Shmoop Corpus: A Dataset of Stories with Loosely Aligned Summaries

Figure 3 for The Shmoop Corpus: A Dataset of Stories with Loosely Aligned Summaries

Figure 4 for The Shmoop Corpus: A Dataset of Stories with Loosely Aligned Summaries

Abstract:Understanding stories is a challenging reading comprehension problem for machines as it requires reading a large volume of text and following long-range dependencies. In this paper, we introduce the Shmoop Corpus: a dataset of 231 stories that are paired with detailed multi-paragraph summaries for each individual chapter (7,234 chapters), where the summary is chronologically aligned with respect to the story chapter. From the corpus, we construct a set of common NLP tasks, including Cloze-form question answering and a simplified form of abstractive summarization, as benchmarks for reading comprehension on stories. We then show that the chronological alignment provides a strong supervisory signal that learning-based methods can exploit leading to significant improvements on these tasks. We believe that the unique structure of this corpus provides an important foothold towards making machine story comprehension more approachable.

* Project page: http://www.cs.toronto.edu/~makarand/shmoop/ Dataset at: https://github.com/achaudhury/shmoop-corpus/

Via

Access Paper or Ask Questions

Kaolin: A PyTorch Library for Accelerating 3D Deep Learning Research

Nov 13, 2019

Krishna Murthy Jatavallabhula, Edward Smith, Jean-Francois Lafleche, Clement Fuji Tsang, Artem Rozantsev, Wenzheng Chen, Tommy Xiang, Rev Lebaredian, Sanja Fidler

Figure 1 for Kaolin: A PyTorch Library for Accelerating 3D Deep Learning Research

Figure 2 for Kaolin: A PyTorch Library for Accelerating 3D Deep Learning Research

Figure 3 for Kaolin: A PyTorch Library for Accelerating 3D Deep Learning Research

Figure 4 for Kaolin: A PyTorch Library for Accelerating 3D Deep Learning Research

Abstract:We present Kaolin, a PyTorch library aiming to accelerate 3D deep learning research. Kaolin provides efficient implementations of differentiable 3D modules for use in deep learning systems. With functionality to load and preprocess several popular 3D datasets, and native functions to manipulate meshes, pointclouds, signed distance functions, and voxel grids, Kaolin mitigates the need to write wasteful boilerplate code. Kaolin packages together several differentiable graphics modules including rendering, lighting, shading, and view warping. Kaolin also supports an array of loss functions and evaluation metrics for seamless evaluation and provides visualization functionality to render the 3D results. Importantly, we curate a comprehensive model zoo comprising many state-of-the-art 3D deep learning architectures, to serve as a starting point for future research endeavours. Kaolin is available as open-source software at https://github.com/NVIDIAGameWorks/kaolin/.

* Kaolin is available as open-source software at https://github.com/NVIDIAGameWorks/kaolin/

Via

Access Paper or Ask Questions

CrevNet: Conditionally Reversible Video Prediction

Oct 25, 2019

Wei Yu, Yichao Lu, Steve Easterbrook, Sanja Fidler

Figure 1 for CrevNet: Conditionally Reversible Video Prediction

Figure 2 for CrevNet: Conditionally Reversible Video Prediction

Abstract:Applying resolution-preserving blocks is a common practice to maximize information preservation in video prediction, yet their high memory consumption greatly limits their application scenarios. We propose CrevNet, a Conditionally Reversible Network that uses reversible architectures to build a bijective two-way autoencoder and its complementary recurrent predictor. Our model enjoys the theoretically guaranteed property of no information loss during the feature extraction, much lower memory consumption and computational efficiency.

Via

Access Paper or Ask Questions

Neural Turtle Graphics for Modeling City Road Layouts

Oct 04, 2019

Hang Chu, Daiqing Li, David Acuna, Amlan Kar, Maria Shugrina, Xinkai Wei, Ming-Yu Liu, Antonio Torralba, Sanja Fidler

Figure 1 for Neural Turtle Graphics for Modeling City Road Layouts

Figure 2 for Neural Turtle Graphics for Modeling City Road Layouts

Figure 3 for Neural Turtle Graphics for Modeling City Road Layouts

Figure 4 for Neural Turtle Graphics for Modeling City Road Layouts

Abstract:We propose Neural Turtle Graphics (NTG), a novel generative model for spatial graphs, and demonstrate its applications in modeling city road layouts. Specifically, we represent the road layout using a graph where nodes in the graph represent control points and edges in the graph represent road segments. NTG is a sequential generative model parameterized by a neural network. It iteratively generates a new node and an edge connecting to an existing node conditioned on the current graph. We train NTG on Open Street Map data and show that it outperforms existing approaches using a set of diverse performance metrics. Moreover, our method allows users to control styles of generated road layouts mimicking existing cities as well as to sketch parts of the city road layout to be synthesized. In addition to synthesis, the proposed NTG finds uses in an analytical task of aerial road parsing. Experimental results show that it achieves state-of-the-art performance on the SpaceNet dataset.

* ICCV-2019 Oral

Via

Access Paper or Ask Questions

DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation

Sep 27, 2019

Xiaohui Zeng, Renjie Liao, Li Gu, Yuwen Xiong, Sanja Fidler, Raquel Urtasun

Figure 1 for DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation

Figure 2 for DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation

Figure 3 for DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation

Figure 4 for DMM-Net: Differentiable Mask-Matching Network for Video Object Segmentation

Abstract:In this paper, we propose the differentiable mask-matching network (DMM-Net) for solving the video object segmentation problem where the initial object masks are provided. Relying on the Mask R-CNN backbone, we extract mask proposals per frame and formulate the matching between object templates and proposals at one time step as a linear assignment problem where the cost matrix is predicted by a CNN. We propose a differentiable matching layer by unrolling a projected gradient descent algorithm in which the projection exploits the Dykstra's algorithm. We prove that under mild conditions, the matching is guaranteed to converge to the optimum. In practice, it performs similarly to the Hungarian algorithm during inference. Meanwhile, we can back-propagate through it to learn the cost matrix. After matching, a refinement head is leveraged to improve the quality of the matched mask. Our DMM-Net achieves competitive results on the largest video object segmentation dataset YouTube-VOS. On DAVIS 2017, DMM-Net achieves the best performance without online learning on the first frames. Without any fine-tuning, DMM-Net performs comparably to state-of-the-art methods on SegTrack v2 dataset. At last, our matching layer is very simple to implement; we attach the PyTorch code ($<50$ lines) in the supplementary material. Our code is released at https://github.com/ZENGXH/DMM_Net.

* ICCV 2019

Via

Access Paper or Ask Questions

A Theoretical Analysis of the Number of Shots in Few-Shot Learning

Sep 25, 2019

Tianshi Cao, Marc Law, Sanja Fidler

Figure 1 for A Theoretical Analysis of the Number of Shots in Few-Shot Learning

Figure 2 for A Theoretical Analysis of the Number of Shots in Few-Shot Learning

Figure 3 for A Theoretical Analysis of the Number of Shots in Few-Shot Learning

Figure 4 for A Theoretical Analysis of the Number of Shots in Few-Shot Learning

Abstract:Few-shot classification is the task of predicting the category of an example from a set of few labeled examples. The number of labeled examples per category is called the number of shots (or shot number). Recent works tackle this task through meta-learning, where a meta-learner extracts information from observed tasks during meta-training to quickly adapt to new tasks during meta-testing. In this formulation, the number of shots exploited during meta-training has an impact on the recognition performance at meta-test time. Generally, the shot number used in meta-training should match the one used in meta-testing to obtain the best performance. We introduce a theoretical analysis of the impact of the shot number on Prototypical Networks, a state-of-the-art few-shot classification method. From our analysis, we propose a simple method that is robust to the choice of shot number used during meta-training, which is a crucial hyperparameter. The performance of our model trained for an arbitrary meta-training shot number shows great performance for different values of meta-testing shot numbers. We experimentally demonstrate our approach on different few-shot classification benchmarks.

* 15 pages incl. appendix, 6 figures

Via

Access Paper or Ask Questions