Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Antonio Torralba

Self-Supervised Audio-Visual Co-Segmentation

Apr 18, 2019

Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, Antonio Torralba

Figure 1 for Self-Supervised Audio-Visual Co-Segmentation

Figure 2 for Self-Supervised Audio-Visual Co-Segmentation

Figure 3 for Self-Supervised Audio-Visual Co-Segmentation

Figure 4 for Self-Supervised Audio-Visual Co-Segmentation

Abstract:Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data. In this paper we develop a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision. The model is an extension of recently proposed work that maps image pixels to sounds. Here, we introduce a learning approach to disentangle concepts in the neural networks, and assign semantic categories to network feature channels to enable independent image segmentation and sound source separation after audio-visual training on videos. Our evaluations show that the disentangled model outperforms several baselines in semantic segmentation and sound source separation.

* Accepted to ICASSP 2019

Via

Access Paper or Ask Questions

The Sound of Motions

Apr 11, 2019

Hang Zhao, Chuang Gan, Wei-Chiu Ma, Antonio Torralba

Abstract:Sounds originate from object motions and vibrations of surrounding air. Inspired by the fact that humans is capable of interpreting sound sources from how objects move visually, we propose a novel system that explicitly captures such motion cues for the task of sound localization and separation. Our system is composed of an end-to-end learnable model called Deep Dense Trajectory (DDT), and a curriculum learning scheme. It exploits the inherent coherence of audio-visual signals from a large quantities of unlabeled videos. Quantitative and qualitative evaluations show that comparing to previous models that rely on visual appearance cues, our motion based system improves performance in separating musical instrument sounds. Furthermore, it separates sound components from duets of the same category of instruments, a challenging problem that has not been addressed before.

Via

Access Paper or Ask Questions

Visualizing and Understanding Generative Adversarial Networks (Extended Abstract)

Jan 29, 2019

David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, Antonio Torralba

Figure 1 for Visualizing and Understanding Generative Adversarial Networks (Extended Abstract)

Figure 2 for Visualizing and Understanding Generative Adversarial Networks (Extended Abstract)

Figure 3 for Visualizing and Understanding Generative Adversarial Networks (Extended Abstract)

Figure 4 for Visualizing and Understanding Generative Adversarial Networks (Extended Abstract)

Abstract:Generative Adversarial Networks (GANs) have achieved impressive results for many real-world applications. As an active research topic, many GAN variants have emerged with improvements in sample quality and training stability. However, visualization and understanding of GANs is largely missing. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect GAN learning? Answering such questions could enable us to develop new insights and better models. In this work, we present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level. We first identify a group of interpretable units that are closely related to concepts with a segmentation-based network dissection method. We quantify the causal effect of interpretable units by measuring the ability of interventions to control objects in the output. Finally, we examine the contextual relationship between these units and their surrounding by inserting the discovered concepts into new images. We show several practical applications enabled by our framework, from comparing internal representations across different layers, models, and datasets, to improving GANs by locating and removing artifact-causing units, to interactively manipulating objects in the scene. We will open source our interactive tools to help researchers and practitioners better understand their models.

* In AAAI-19 workshop on Network Interpretability for Deep Learning arXiv admin note: substantial text overlap with arXiv:1811.10597

Via

Access Paper or Ask Questions

GAN Dissection: Visualizing and Understanding Generative Adversarial Networks

Dec 08, 2018

David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, Antonio Torralba

Figure 1 for GAN Dissection: Visualizing and Understanding Generative Adversarial Networks

Figure 2 for GAN Dissection: Visualizing and Understanding Generative Adversarial Networks

Figure 3 for GAN Dissection: Visualizing and Understanding Generative Adversarial Networks

Figure 4 for GAN Dissection: Visualizing and Understanding Generative Adversarial Networks

Abstract:Generative Adversarial Networks (GANs) have recently achieved impressive results for many real-world applications, and many GAN variants have emerged with improvements in sample quality and training stability. However, they have not been well visualized or understood. How does a GAN represent our visual world internally? What causes the artifacts in GAN results? How do architectural choices affect GAN learning? Answering such questions could enable us to develop new insights and better models. In this work, we present an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level. We first identify a group of interpretable units that are closely related to object concepts using a segmentation-based network dissection method. Then, we quantify the causal effect of interpretable units by measuring the ability of interventions to control objects in the output. We examine the contextual relationship between these units and their surroundings by inserting the discovered object concepts into new images. We show several practical applications enabled by our framework, from comparing internal representations across different layers, models, and datasets, to improving GANs by locating and removing artifact-causing units, to interactively manipulating objects in a scene. We provide open source interpretation tools to help researchers and practitioners better understand their GAN models.

* 18 pages, 19 figures

Via

Access Paper or Ask Questions

Visual Object Networks: Image Generation with Disentangled 3D Representation

Dec 06, 2018

Jun-Yan Zhu, Zhoutong Zhang, Chengkai Zhang, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum, William T. Freeman

Figure 1 for Visual Object Networks: Image Generation with Disentangled 3D Representation

Figure 2 for Visual Object Networks: Image Generation with Disentangled 3D Representation

Figure 3 for Visual Object Networks: Image Generation with Disentangled 3D Representation

Figure 4 for Visual Object Networks: Image Generation with Disentangled 3D Representation

Abstract:Recent progress in deep generative models has led to tremendous breakthroughs in image generation. However, while existing models can synthesize photorealistic images, they lack an understanding of our underlying 3D world. We present a new generative model, Visual Object Networks (VON), synthesizing natural images of objects with a disentangled 3D representation. Inspired by classic graphics rendering pipelines, we unravel our image formation process into three conditionally independent factors---shape, viewpoint, and texture---and present an end-to-end adversarial learning framework that jointly models 3D shapes and 2D images. Our model first learns to synthesize 3D shapes that are indistinguishable from real shapes. It then renders the object's 2.5D sketches (i.e., silhouette and depth map) from its shape under a sampled viewpoint. Finally, it learns to add realistic texture to these 2.5D sketches to generate natural images. The VON not only generates images that are more realistic than state-of-the-art 2D image synthesis methods, but also enables many 3D operations such as changing the viewpoint of a generated image, editing of shape and texture, linear interpolation in texture and shape space, and transferring appearance across different objects and viewpoints.

* NeurIPS 2018. Code: https://github.com/junyanz/VON Website: http://von.csail.mit.edu/

Via

Access Paper or Ask Questions

Dataset Distillation

Nov 27, 2018

Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, Alexei A. Efros

Abstract:Model distillation aims to distill the knowledge of a complex model into a simpler one. In this paper, we consider an alternative formulation called {\em dataset distillation}: we keep the model fixed and instead attempt to distill the knowledge from a large training dataset into a small one. The idea is to {\em synthesize} a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. For example, we show that it is possible to compress $60,000$ MNIST training images into just $10$ synthetic {\em distilled images} (one per class) and achieve close to original performance with only a few steps of gradient descent, given a particular fixed network initialization. We evaluate our method in a wide range of initialization settings and with different learning objectives. Experiments on multiple datasets show the advantage of our approach compared to alternative methods in most settings.

Via

Access Paper or Ask Questions

Semantic Understanding of Scenes through the ADE20K Dataset

Oct 16, 2018

Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, Antonio Torralba

Figure 1 for Semantic Understanding of Scenes through the ADE20K Dataset

Figure 2 for Semantic Understanding of Scenes through the ADE20K Dataset

Figure 3 for Semantic Understanding of Scenes through the ADE20K Dataset

Figure 4 for Semantic Understanding of Scenes through the ADE20K Dataset

Abstract:Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the community's efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A generic network design called Cascade Segmentation Module is then proposed to enable the segmentation networks to parse a scene into stuff, objects, and object parts in a cascade. We evaluate the proposed module integrated within two existing semantic segmentation networks, yielding significant improvements for scene parsing. We further show that the scene parsing networks trained on ADE20K can be applied to a wide variety of scenes and objects.

* IJCV extension

Via

Access Paper or Ask Questions

Recipe1M: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

Oct 14, 2018

Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, Yusuf Aytar, Ingmar Weber, Antonio Torralba

Figure 1 for Recipe1M: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

Figure 2 for Recipe1M: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

Figure 3 for Recipe1M: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

Figure 4 for Recipe1M: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images

Abstract:In this paper, we introduce Recipe1M, a new large-scale, structured corpus of over one million cooking recipes and 13 million food images. As the largest publicly available collection of recipe data, Recipe1M affords the ability to train high-capacity models on aligned, multi-modal data. Using these data, we train a neural network to learn a joint embedding of recipes and images that yields impressive results on an image-recipe retrieval task. Moreover, we demonstrate that regularization via the addition of a high-level classification objective both improves retrieval performance to rival that of humans and enables semantic vector arithmetic. We postulate that these embeddings will provide a basis for further exploration of the Recipe1M dataset and food and cooking in general. Code, data and models are publicly available.

* Submitted to Transactions on Pattern Analysis and Machine Intelligence

Via

Access Paper or Ask Questions

The Sound of Pixels

Oct 14, 2018

Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, Antonio Torralba

Abstract:We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel. Our approach capitalizes on the natural synchronization of the visual and audio modalities to learn models that jointly parse sounds and images, without requiring additional manual supervision. Experimental results on a newly collected MUSIC dataset show that our proposed Mix-and-Separate framework outperforms several baselines on source separation. Qualitative results suggest our model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources.

Via

Access Paper or Ask Questions

3D-Aware Scene Manipulation via Inverse Graphics

Oct 11, 2018

Shunyu Yao, Tzu Ming Harry Hsu, Jun-Yan Zhu, Jiajun Wu, Antonio Torralba, William T. Freeman, Joshua B. Tenenbaum

Figure 1 for 3D-Aware Scene Manipulation via Inverse Graphics

Figure 2 for 3D-Aware Scene Manipulation via Inverse Graphics

Figure 3 for 3D-Aware Scene Manipulation via Inverse Graphics

Figure 4 for 3D-Aware Scene Manipulation via Inverse Graphics

Abstract:We aim to obtain an interpretable, expressive, and disentangled scene representation that contains comprehensive structural and textural information for each object. Previous scene representations learned by neural networks are often uninterpretable, limited to a single object, or lacking 3D knowledge. In this work, we propose 3D scene de-rendering networks (3D-SDN) to address the above issues by integrating disentangled representations for semantics, geometry, and appearance into a deep generative model. Our scene encoder performs inverse graphics, translating a scene into a structured object-wise representation. Our decoder has two components: a differentiable shape renderer and a neural texture generator. The disentanglement of semantics, geometry, and appearance supports 3D-aware scene manipulation, e.g., rotating and moving objects freely while keeping the consistent shape and texture, and changing the object appearance without affecting its shape. Experiments demonstrate that our editing scheme based on 3D-SDN is superior to its 2D counterpart.

* NIPS 2018

Via

Access Paper or Ask Questions