Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Antonio Torralba

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Oct 13, 2021

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu(+74 more)

Figure 1 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Figure 2 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Figure 3 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Figure 4 for Ego4D: Around the World in 3,000 Hours of Egocentric Video

Abstract:We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/

Via

Access Paper or Ask Questions

OPEn: An Open-ended Physics Environment for Learning Without a Task

Oct 13, 2021

Chuang Gan, Abhishek Bhandwaldar, Antonio Torralba, Joshua B. Tenenbaum, Phillip Isola

Figure 1 for OPEn: An Open-ended Physics Environment for Learning Without a Task

Figure 2 for OPEn: An Open-ended Physics Environment for Learning Without a Task

Figure 3 for OPEn: An Open-ended Physics Environment for Learning Without a Task

Figure 4 for OPEn: An Open-ended Physics Environment for Learning Without a Task

Abstract:Humans have mental models that allow them to plan, experiment, and reason in the physical world. How should an intelligent agent go about learning such models? In this paper, we will study if models of the world learned in an open-ended physics environment, without any specific tasks, can be reused for downstream physics reasoning tasks. To this end, we build a benchmark Open-ended Physics ENvironment (OPEn) and also design several tasks to test learning representations in this environment explicitly. This setting reflects the conditions in which real agents (i.e. rolling robots) find themselves, where they may be placed in a new kind of environment and must adapt without any teacher to tell them how this environment works. This setting is challenging because it requires solving an exploration problem in addition to a model building and representation learning problem. We test several existing RL-based exploration methods on this benchmark and find that an agent using unsupervised contrastive learning for representation learning, and impact-driven learning for exploration, achieved the best results. However, all models still fall short in sample efficiency when transferring to the downstream tasks. We expect that OPEn will encourage the development of novel rolling robot agents that can build reusable mental models of the world that facilitate many tasks.

* IROS 2021. Project page: http://open.csail.mit.edu/

Via

Access Paper or Ask Questions

Toward a Visual Concept Vocabulary for GAN Latent Space

Oct 08, 2021

Sarah Schwettmann, Evan Hernandez, David Bau, Samuel Klein, Jacob Andreas, Antonio Torralba

Figure 1 for Toward a Visual Concept Vocabulary for GAN Latent Space

Figure 2 for Toward a Visual Concept Vocabulary for GAN Latent Space

Figure 3 for Toward a Visual Concept Vocabulary for GAN Latent Space

Figure 4 for Toward a Visual Concept Vocabulary for GAN Latent Space

Abstract:A large body of recent work has identified transformations in the latent spaces of generative adversarial networks (GANs) that consistently and interpretably transform generated images. But existing techniques for identifying these transformations rely on either a fixed vocabulary of pre-specified visual concepts, or on unsupervised disentanglement techniques whose alignment with human judgments about perceptual salience is unknown. This paper introduces a new method for building open-ended vocabularies of primitive visual concepts represented in a GAN's latent space. Our approach is built from three components: (1) automatic identification of perceptually salient directions based on their layer selectivity; (2) human annotation of these directions with free-form, compositional natural language descriptions; and (3) decomposition of these annotations into a visual concept vocabulary, consisting of distilled directions labeled with single words. Experiments show that concepts learned with our approach are reliable and composable -- generalizing across classes, contexts, and observers, and enabling fine-grained manipulation of image style and content.

* 15 pages, 13 figures. Accepted to ICCV 2021. Project page: https://visualvocab.csail.mit.edu

Via

Access Paper or Ask Questions

Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Oct 07, 2021

Shuang Li, Yilun Du, Antonio Torralba, Josef Sivic, Bryan Russell

Figure 1 for Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Figure 2 for Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Figure 3 for Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Figure 4 for Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

Abstract:We introduce the task of weakly supervised learning for detecting human and object interactions in videos. Our task poses unique challenges as a system does not know what types of human-object interactions are present in a video or the actual spatiotemporal location of the human and the object. To address these challenges, we introduce a contrastive weakly supervised training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects as a form of self-supervision. To train our model, we introduce a dataset comprising over 6.5k videos with human-object interaction annotations that have been semi-automatically curated from sentence captions associated with the videos. We demonstrate improved performance over weakly supervised baselines adapted to our task on our video dataset.

Via

Access Paper or Ask Questions

Scaling up instance annotation via label propagation

Oct 05, 2021

Dim P. Papadopoulos, Ethan Weber, Antonio Torralba

Figure 1 for Scaling up instance annotation via label propagation

Figure 2 for Scaling up instance annotation via label propagation

Figure 3 for Scaling up instance annotation via label propagation

Figure 4 for Scaling up instance annotation via label propagation

Abstract:Manually annotating object segmentation masks is very time-consuming. While interactive segmentation methods offer a more efficient alternative, they become unaffordable at a large scale because the cost grows linearly with the number of annotated masks. In this paper, we propose a highly efficient annotation scheme for building large datasets with object segmentation masks. At a large scale, images contain many object instances with similar appearance. We exploit these similarities by using hierarchical clustering on mask predictions made by a segmentation model. We propose a scheme that efficiently searches through the hierarchy of clusters and selects which clusters to annotate. Humans manually verify only a few masks per cluster, and the labels are propagated to the whole cluster. Through a large-scale experiment to populate 1M unlabeled images with object segmentation masks for 80 object classes, we show that (1) we obtain 1M object segmentation masks with an total annotation time of only 290 hours; (2) we reduce annotation time by 76x compared to manual annotation; (3) the segmentation quality of our masks is on par with those from manually annotated datasets. Code, data, and models are available online.

* ICCV 2021

Via

Access Paper or Ask Questions

Skill Induction and Planning with Latent Language

Oct 04, 2021

Pratyusha Sharma, Antonio Torralba, Jacob Andreas

Figure 1 for Skill Induction and Planning with Latent Language

Figure 2 for Skill Induction and Planning with Latent Language

Figure 3 for Skill Induction and Planning with Latent Language

Figure 4 for Skill Induction and Planning with Latent Language

Abstract:We present a framework for learning hierarchical policies from demonstrations, using sparse natural language annotations to guide the discovery of reusable skills for autonomous decision-making. We formulate a generative model of action sequences in which goals generate sequences of high-level subtask descriptions, and these descriptions generate sequences of low-level actions. We describe how to train this model using primarily unannotated demonstrations by parsing demonstrations into sequences of named high-level subtasks, using only a small number of seed annotations to ground language in action. In trained models, the space of natural language commands indexes a combinatorial library of skills; agents can use these skills to plan by generating high-level instruction sequences tailored to novel goals. We evaluate this approach in the ALFRED household simulation environment, providing natural language annotations for only 10% of demonstrations. It completes more than twice as many tasks as a standard approach to learning from demonstrations, matching the performance of instruction following models with access to ground-truth plans during both training and evaluation.

* 13 pages, 6 figures

Via

Access Paper or Ask Questions

Dynamic Modeling of Hand-Object Interactions via Tactile Sensing

Sep 09, 2021

Qiang Zhang, Yunzhu Li, Yiyue Luo, Wan Shou, Michael Foshey, Junchi Yan, Joshua B. Tenenbaum, Wojciech Matusik, Antonio Torralba

Figure 1 for Dynamic Modeling of Hand-Object Interactions via Tactile Sensing

Figure 2 for Dynamic Modeling of Hand-Object Interactions via Tactile Sensing

Figure 3 for Dynamic Modeling of Hand-Object Interactions via Tactile Sensing

Figure 4 for Dynamic Modeling of Hand-Object Interactions via Tactile Sensing

Abstract:Tactile sensing is critical for humans to perform everyday tasks. While significant progress has been made in analyzing object grasping from vision, it remains unclear how we can utilize tactile sensing to reason about and model the dynamics of hand-object interactions. In this work, we employ a high-resolution tactile glove to perform four different interactive activities on a diversified set of objects. We build our model on a cross-modal learning framework and generate the labels using a visual processing pipeline to supervise the tactile model, which can then be used on its own during the test time. The tactile model aims to predict the 3d locations of both the hand and the object purely from the touch data by combining a predictive model and a contrastive learning module. This framework can reason about the interaction patterns from the tactile data, hallucinate the changes in the environment, estimate the uncertainty of the prediction, and generalize to unseen objects. We also provide detailed ablation studies regarding different system designs as well as visualizations of the predicted trajectories. This work takes a step on dynamics modeling in hand-object interactions from dense tactile sensing, which opens the door for future applications in activity learning, human-computer interactions, and imitation learning for robotics.

* IROS 2021. First two authors contributed equally. Project page: http://phystouch.csail.mit.edu/

Via

Access Paper or Ask Questions

What You Can Learn by Staring at a Blank Wall

Aug 30, 2021

Prafull Sharma, Miika Aittala, Yoav Y. Schechner, Antonio Torralba, Gregory W. Wornell, William T. Freeman, Fredo Durand

Figure 1 for What You Can Learn by Staring at a Blank Wall

Figure 2 for What You Can Learn by Staring at a Blank Wall

Figure 3 for What You Can Learn by Staring at a Blank Wall

Figure 4 for What You Can Learn by Staring at a Blank Wall

Abstract:We present a passive non-line-of-sight method that infers the number of people or activity of a person from the observation of a blank wall in an unknown room. Our technique analyzes complex imperceptible changes in indirect illumination in a video of the wall to reveal a signal that is correlated with motion in the hidden part of a scene. We use this signal to classify between zero, one, or two moving people, or the activity of a person in the hidden scene. We train two convolutional neural networks using data collected from 20 different scenes, and achieve an accuracy of $\approx94\%$ for both tasks in unseen test environments and real-time online settings. Unlike other passive non-line-of-sight methods, the technique does not rely on known occluders or controllable light sources, and generalizes to unknown rooms with no re-calibration. We analyze the generalization and robustness of our method with both real and synthetic data, and study the effect of the scene parameters on the signal quality.

Via

Access Paper or Ask Questions

3D Neural Scene Representations for Visuomotor Control

Jul 08, 2021

Yunzhu Li, Shuang Li, Vincent Sitzmann, Pulkit Agrawal, Antonio Torralba

Figure 1 for 3D Neural Scene Representations for Visuomotor Control

Figure 2 for 3D Neural Scene Representations for Visuomotor Control

Figure 3 for 3D Neural Scene Representations for Visuomotor Control

Figure 4 for 3D Neural Scene Representations for Visuomotor Control

Abstract:Humans have a strong intuitive understanding of the 3D environment around us. The mental model of the physics in our brain applies to objects of different materials and enables us to perform a wide range of manipulation tasks that are far beyond the reach of current robots. In this work, we desire to learn models for dynamic 3D scenes purely from 2D visual observations. Our model combines Neural Radiance Fields (NeRF) and time contrastive learning with an autoencoding framework, which learns viewpoint-invariant 3D-aware scene representations. We show that a dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks involving both rigid bodies and fluids, where the target is specified in a viewpoint different from what the robot operates on. When coupled with an auto-decoding framework, it can even support goal specification from camera viewpoints that are outside the training distribution. We further demonstrate the richness of the learned 3D dynamics model by performing future prediction and novel view synthesis. Finally, we provide detailed ablation studies regarding different system designs and qualitative analysis of the learned representations.

* First two authors contributed equally. Project Page: https://3d-representation-learning.github.io/nerf-dy/

Via

Access Paper or Ask Questions

Learning to See by Looking at Noise

Jun 10, 2021

Manel Baradad, Jonas Wulff, Tongzhou Wang, Phillip Isola, Antonio Torralba

Figure 1 for Learning to See by Looking at Noise

Figure 2 for Learning to See by Looking at Noise

Figure 3 for Learning to See by Looking at Noise

Figure 4 for Learning to See by Looking at Noise

Abstract:Current vision systems are trained on huge datasets, and these datasets come with costs: curation is expensive, they inherit human biases, and there are concerns over privacy and usage rights. To counter these costs, interest has surged in learning from cheaper data sources, such as unlabeled images. In this paper we go a step further and ask if we can do away with real image datasets entirely, instead learning from noise processes. We investigate a suite of image generation models that produce images from simple random processes. These are then used as training data for a visual representation learner with a contrastive loss. We study two types of noise processes, statistical image models and deep generative models under different random initializations. Our findings show that it is important for the noise to capture certain structural properties of real data but that good performance can be achieved even with processes that are far from realistic. We also find that diversity is a key property to learn good representations. Datasets, models, and code are available at https://mbaradad.github.io/learning_with_noise.

Via

Access Paper or Ask Questions