Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Antonio Torralba

Virtual Correspondence: Humans as a Cue for Extreme-View Geometry

Jun 16, 2022

Wei-Chiu Ma, Anqi Joyce Yang, Shenlong Wang, Raquel Urtasun, Antonio Torralba

Figure 1 for Virtual Correspondence: Humans as a Cue for Extreme-View Geometry

Figure 2 for Virtual Correspondence: Humans as a Cue for Extreme-View Geometry

Figure 3 for Virtual Correspondence: Humans as a Cue for Extreme-View Geometry

Figure 4 for Virtual Correspondence: Humans as a Cue for Extreme-View Geometry

Abstract:Recovering the spatial layout of the cameras and the geometry of the scene from extreme-view images is a longstanding challenge in computer vision. Prevailing 3D reconstruction algorithms often adopt the image matching paradigm and presume that a portion of the scene is co-visible across images, yielding poor performance when there is little overlap among inputs. In contrast, humans can associate visible parts in one image to the corresponding invisible components in another image via prior knowledge of the shapes. Inspired by this fact, we present a novel concept called virtual correspondences (VCs). VCs are a pair of pixels from two images whose camera rays intersect in 3D. Similar to classic correspondences, VCs conform with epipolar geometry; unlike classic correspondences, VCs do not need to be co-visible across views. Therefore VCs can be established and exploited even if images do not overlap. We introduce a method to find virtual correspondences based on humans in the scene. We showcase how VCs can be seamlessly integrated with classic bundle adjustment to recover camera poses across extreme views. Experiments show that our method significantly outperforms state-of-the-art camera pose estimation methods in challenging scenarios and is comparable in the traditional densely captured setup. Our approach also unleashes the potential of multiple downstream tasks such as scene reconstruction from multi-view stereo and novel view synthesis in extreme-view scenarios.

* CVPR 2022. Project page: https://people.csail.mit.edu/weichium/virtual-correspondence/

Via

Access Paper or Ask Questions

Disentangling visual and written concepts in CLIP

Jun 15, 2022

Joanna Materzynska, Antonio Torralba, David Bau

Figure 1 for Disentangling visual and written concepts in CLIP

Figure 2 for Disentangling visual and written concepts in CLIP

Figure 3 for Disentangling visual and written concepts in CLIP

Figure 4 for Disentangling visual and written concepts in CLIP

Abstract:The CLIP network measures the similarity between natural text and images; in this work, we investigate the entanglement of the representation of word images and natural images in its image encoder. First, we find that the image encoder has an ability to match word images with natural images of scenes described by those words. This is consistent with previous research that suggests that the meaning and the spelling of a word might be entangled deep within the network. On the other hand, we also find that CLIP has a strong ability to match nonsense words, suggesting that processing of letters is separated from processing of their meaning. To explicitly determine whether the spelling capability of CLIP is separable, we devise a procedure for identifying representation subspaces that selectively isolate or eliminate spelling capabilities. We benchmark our methods against a range of retrieval tasks, and we also test them by measuring the appearance of text in CLIP-guided generated images. We find that our methods are able to cleanly separate spelling capabilities of CLIP from the visual processing of natural images.

Via

Access Paper or Ask Questions

Compositional Visual Generation with Composable Diffusion Models

Jun 08, 2022

Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, Joshua B. Tenenbaum

Figure 1 for Compositional Visual Generation with Composable Diffusion Models

Figure 2 for Compositional Visual Generation with Composable Diffusion Models

Figure 3 for Compositional Visual Generation with Composable Diffusion Models

Figure 4 for Compositional Visual Generation with Composable Diffusion Models

Abstract:Large text-guided diffusion models, such as DALLE-2, are able to generate stunning photorealistic images given natural language descriptions. While such models are highly flexible, they struggle to understand the composition of certain concepts, such as confusing the attributes of different objects or relations between objects. In this paper, we propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. To do this, we interpret diffusion models as energy-based models in which the data distributions defined by the energy functions may be explicitly combined. The proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen in the real world. We further illustrate how our approach may be used to compose pre-trained text-guided diffusion models and generate photorealistic images containing all the details described in the input descriptions, including the binding of certain object attributes that have been shown difficult for DALLE-2. These results point to the effectiveness of the proposed method in promoting structured generalization for visual generation. Project page: https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/

* Project website: https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/ First three authors contributed equally

Via

Access Paper or Ask Questions

Polymorphic-GAN: Generating Aligned Samples across Multiple Domains with Learned Morph Maps

Jun 06, 2022

Seung Wook Kim, Karsten Kreis, Daiqing Li, Antonio Torralba, Sanja Fidler

Figure 1 for Polymorphic-GAN: Generating Aligned Samples across Multiple Domains with Learned Morph Maps

Figure 2 for Polymorphic-GAN: Generating Aligned Samples across Multiple Domains with Learned Morph Maps

Figure 3 for Polymorphic-GAN: Generating Aligned Samples across Multiple Domains with Learned Morph Maps

Figure 4 for Polymorphic-GAN: Generating Aligned Samples across Multiple Domains with Learned Morph Maps

Abstract:Modern image generative models show remarkable sample quality when trained on a single domain or class of objects. In this work, we introduce a generative adversarial network that can simultaneously generate aligned image samples from multiple related domains. We leverage the fact that a variety of object classes share common attributes, with certain geometric differences. We propose Polymorphic-GAN which learns shared features across all domains and a per-domain morph layer to morph shared features according to each domain. In contrast to previous works, our framework allows simultaneous modelling of images with highly varying geometries, such as images of human faces, painted and artistic faces, as well as multiple different animal faces. We demonstrate that our model produces aligned samples for all domains and show how it can be used for applications such as segmentation transfer and cross-domain image editing, as well as training in low-data regimes. Additionally, we apply our Polymorphic-GAN on image-to-image translation tasks and show that we can greatly surpass previous approaches in cases where the geometric differences between domains are large.

* CVPR 2022 Oral

Via

Access Paper or Ask Questions

Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction

May 05, 2022

Yining Hong, Kaichun Mo, Li Yi, Leonidas J. Guibas, Antonio Torralba, Joshua B. Tenenbaum, Chuang Gan

Figure 1 for Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction

Figure 2 for Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction

Figure 3 for Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction

Figure 4 for Fixing Malfunctional Objects With Learned Physical Simulation and Functional Prediction

Abstract:This paper studies the problem of fixing malfunctional 3D objects. While previous works focus on building passive perception models to learn the functionality from static 3D objects, we argue that functionality is reckoned with respect to the physical interactions between the object and the user. Given a malfunctional object, humans can perform mental simulations to reason about its functionality and figure out how to fix it. Inspired by this, we propose FixIt, a dataset that contains about 5k poorly-designed 3D physical objects paired with choices to fix them. To mimic humans' mental simulation process, we present FixNet, a novel framework that seamlessly incorporates perception and physical dynamics. Specifically, FixNet consists of a perception module to extract the structured representation from the 3D point cloud, a physical dynamics prediction module to simulate the results of interactions on 3D objects, and a functionality prediction module to evaluate the functionality and choose the correct fix. Experimental results show that our framework outperforms baseline models by a large margin, and can generalize well to objects with similar interaction types.

* CVPR 2022. Project page: http://fixing-malfunctional.csail.mit.edu

Via

Access Paper or Ask Questions

ComPhy: Compositional Physical Reasoning of Objects and Events from Videos

May 02, 2022

Zhenfang Chen, Kexin Yi, Yunzhu Li, Mingyu Ding, Antonio Torralba, Joshua B. Tenenbaum, Chuang Gan

Figure 1 for ComPhy: Compositional Physical Reasoning of Objects and Events from Videos

Figure 2 for ComPhy: Compositional Physical Reasoning of Objects and Events from Videos

Figure 3 for ComPhy: Compositional Physical Reasoning of Objects and Events from Videos

Figure 4 for ComPhy: Compositional Physical Reasoning of Objects and Events from Videos

Abstract:Objects' motions in nature are governed by complex interactions and their properties. While some properties, such as shape and material, can be identified via the object's visual appearances, others like mass and electric charge are not directly visible. The compositionality between the visible and hidden properties poses unique challenges for AI models to reason from the physical world, whereas humans can effortlessly infer them with limited observations. Existing studies on video reasoning mainly focus on visually observable elements such as object appearance, movement, and contact interaction. In this paper, we take an initial step to highlight the importance of inferring the hidden physical properties not directly observable from visual appearances, by introducing the Compositional Physical Reasoning (ComPhy) dataset. For a given set of objects, ComPhy includes few videos of them moving and interacting under different initial conditions. The model is evaluated based on its capability to unravel the compositional hidden properties, such as mass and charge, and use this knowledge to answer a set of questions posted on one of the videos. Evaluation results of several state-of-the-art video reasoning models on ComPhy show unsatisfactory performance as they fail to capture these hidden properties. We further propose an oracle neural-symbolic framework named Compositional Physics Learner (CPL), combining visual perception, physical property learning, dynamic prediction, and symbolic execution into a unified framework. CPL can effectively identify objects' physical properties from their interactions and predict their dynamics to answer questions.

* ICLR 2022. Project page: https://comphyreasoning.github.io/

Via

Access Paper or Ask Questions

Correcting Robot Plans with Natural Language Feedback

Apr 11, 2022

Pratyusha Sharma, Balakumar Sundaralingam, Valts Blukis, Chris Paxton, Tucker Hermans, Antonio Torralba, Jacob Andreas, Dieter Fox

Figure 1 for Correcting Robot Plans with Natural Language Feedback

Figure 2 for Correcting Robot Plans with Natural Language Feedback

Figure 3 for Correcting Robot Plans with Natural Language Feedback

Figure 4 for Correcting Robot Plans with Natural Language Feedback

Abstract:When humans design cost or goal specifications for robots, they often produce specifications that are ambiguous, underspecified, or beyond planners' ability to solve. In these cases, corrections provide a valuable tool for human-in-the-loop robot control. Corrections might take the form of new goal specifications, new constraints (e.g. to avoid specific objects), or hints for planning algorithms (e.g. to visit specific waypoints). Existing correction methods (e.g. using a joystick or direct manipulation of an end effector) require full teleoperation or real-time interaction. In this paper, we explore natural language as an expressive and flexible tool for robot correction. We describe how to map from natural language sentences to transformations of cost functions. We show that these transformations enable users to correct goals, update robot motions to accommodate additional user preferences, and recover from planning errors. These corrections can be leveraged to get 81% and 93% success rates on tasks where the original planner failed, with either one or two language corrections. Our method makes it possible to compose multiple constraints and generalizes to unseen scenes, objects, and sentences in simulated environments and real-world environments.

* 10 pages, 13 figures

Via

Access Paper or Ask Questions

Learning Neural Acoustic Fields

Apr 04, 2022

Andrew Luo, Yilun Du, Michael J. Tarr, Joshua B. Tenenbaum, Antonio Torralba, Chuang Gan

Figure 1 for Learning Neural Acoustic Fields

Figure 2 for Learning Neural Acoustic Fields

Figure 3 for Learning Neural Acoustic Fields

Figure 4 for Learning Neural Acoustic Fields

Abstract:Our environment is filled with rich and dynamic acoustic information. When we walk into a cathedral, the reverberations as much as appearance inform us of the sanctuary's wide open space. Similarly, as an object moves around us, we expect the sound emitted to also exhibit this movement. While recent advances in learned implicit functions have led to increasingly higher quality representations of the visual world, there have not been commensurate advances in learning spatial auditory representations. To address this gap, we introduce Neural Acoustic Fields (NAFs), an implicit representation that captures how sounds propagate in a physical scene. By modeling acoustic propagation in a scene as a linear time-invariant system, NAFs learn to continuously map all emitter and listener location pairs to a neural impulse response function that can then be applied to arbitrary sounds. We demonstrate that the continuous nature of NAFs enables us to render spatial acoustics for a listener at an arbitrary location, and can predict sound propagation at novel locations. We further show that the representation learned by NAFs can help improve visual learning with sparse views. Finally, we show that a representation informative of scene structure emerges during the learning of NAFs.

* Project page: https://www.andrew.cmu.edu/user/afluo/Neural_Acoustic_Fields/

Via

Access Paper or Ask Questions

Learning Program Representations for Food Images and Cooking Recipes

Mar 30, 2022

Dim P. Papadopoulos, Enrique Mora, Nadiia Chepurko, Kuan Wei Huang, Ferda Ofli, Antonio Torralba

Figure 1 for Learning Program Representations for Food Images and Cooking Recipes

Figure 2 for Learning Program Representations for Food Images and Cooking Recipes

Figure 3 for Learning Program Representations for Food Images and Cooking Recipes

Figure 4 for Learning Program Representations for Food Images and Cooking Recipes

Abstract:In this paper, we are interested in modeling a how-to instructional procedure, such as a cooking recipe, with a meaningful and rich high-level representation. Specifically, we propose to represent cooking recipes and food images as cooking programs. Programs provide a structured representation of the task, capturing cooking semantics and sequential relationships of actions in the form of a graph. This allows them to be easily manipulated by users and executed by agents. To this end, we build a model that is trained to learn a joint embedding between recipes and food images via self-supervision and jointly generate a program from this embedding as a sequence. To validate our idea, we crowdsource programs for cooking recipes and show that: (a) projecting the image-recipe embeddings into programs leads to better cross-modal retrieval results; (b) generating programs from images leads to better recognition results compared to predicting raw cooking instructions; and (c) we can generate food images by manipulating programs via optimizing the latent code of a GAN. Code, data, and models are available online.

* CVPR 2022 oral

Via

Access Paper or Ask Questions

Dataset Distillation by Matching Training Trajectories

Mar 22, 2022

George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, Jun-Yan Zhu

Figure 1 for Dataset Distillation by Matching Training Trajectories

Figure 2 for Dataset Distillation by Matching Training Trajectories

Figure 3 for Dataset Distillation by Matching Training Trajectories

Figure 4 for Dataset Distillation by Matching Training Trajectories

Abstract:Dataset distillation is the task of synthesizing a small dataset such that a model trained on the synthetic set will match the test accuracy of the model trained on the full dataset. In this paper, we propose a new formulation that optimizes our distilled data to guide networks to a similar state as those trained on real data across many training steps. Given a network, we train it for several iterations on our distilled data and optimize the distilled data with respect to the distance between the synthetically trained parameters and the parameters trained on real data. To efficiently obtain the initial and target network parameters for large-scale datasets, we pre-compute and store training trajectories of expert networks trained on the real dataset. Our method handily outperforms existing methods and also allows us to distill higher-resolution visual data.

* CVPR 2022 website: https://georgecazenavette.github.io/mtt-distillation/ code: https://github.com/GeorgeCazenavette/mtt-distillation

Via

Access Paper or Ask Questions