Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jitendra Malik

Better Knowledge Retention through Metric Learning

Nov 26, 2020
Ke Li, Shichong Peng, Kailas Vodrahalli, Jitendra Malik

Figure 1 for Better Knowledge Retention through Metric Learning

Figure 2 for Better Knowledge Retention through Metric Learning

Figure 3 for Better Knowledge Retention through Metric Learning

Figure 4 for Better Knowledge Retention through Metric Learning

In continual learning, new categories may be introduced over time, and an ideal learning system should perform well on both the original categories and the new categories. While deep neural nets have achieved resounding success in the classical supervised setting, they are known to forget about knowledge acquired in prior episodes of learning if the examples encountered in the current episode of learning are drastically different from those encountered in prior episodes. In this paper, we propose a new method that can both leverage the expressive power of deep neural nets and is resilient to forgetting when new categories are introduced. We found the proposed method can reduce forgetting by 2.3x to 6.9x on CIFAR-10 compared to existing methods and by 1.8x to 2.7x on ImageNet compared to an oracle baseline.

Via

Access Paper or Ask Questions

Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

Nov 13, 2020
Bryan Chen, Alexander Sax, Gene Lewis, Iro Armeni, Silvio Savarese, Amir Zamir, Jitendra Malik, Lerrel Pinto

Figure 1 for Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

Figure 2 for Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

Figure 3 for Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

Figure 4 for Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

Vision-based robotics often separates the control loop into one module for perception and a separate module for control. It is possible to train the whole system end-to-end (e.g. with deep RL), but doing it "from scratch" comes with a high sample complexity cost and the final result is often brittle, failing unexpectedly if the test environment differs from that of training. We study the effects of using mid-level visual representations (features learned asynchronously for traditional computer vision objectives), as a generic and easy-to-decode perceptual state in an end-to-end RL framework. Mid-level representations encode invariances about the world, and we show that they aid generalization, improve sample complexity, and lead to a higher final performance. Compared to other approaches for incorporating invariances, such as domain randomization, asynchronously trained mid-level representations scale better: both to harder problems and to larger domain shifts. In practice, this means that mid-level representations could be used to successfully train policies for tasks where domain randomization and learning-from-scratch failed. We report results on both manipulation and navigation tasks, and for navigation include zero-shot sim-to-real experiments on real robots.

* Extended version of CoRL 2020 camera ready. Supplementary released separately

Via

Access Paper or Ask Questions

Rearrangement: A Challenge for Embodied AI

Nov 03, 2020
Dhruv Batra, Angel X. Chang, Sonia Chernova, Andrew J. Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, Manolis Savva, Hao Su

Figure 1 for Rearrangement: A Challenge for Embodied AI

Figure 2 for Rearrangement: A Challenge for Embodied AI

Figure 3 for Rearrangement: A Challenge for Embodied AI

Figure 4 for Rearrangement: A Challenge for Embodied AI

We describe a framework for research and evaluation in Embodied AI. Our proposal is based on a canonical task: Rearrangement. A standard task can focus the development of new techniques and serve as a source of trained models that can be transferred to other settings. In the rearrangement task, the goal is to bring a given physical environment into a specified state. The goal state can be specified by object poses, by images, by a description in language, or by letting the agent experience the environment in the goal state. We characterize rearrangement scenarios along different axes and describe metrics for benchmarking rearrangement performance. To facilitate research and exploration, we present experimental testbeds of rearrangement scenarios in four different simulation environments. We anticipate that other datasets will be released and new simulation platforms will be built to support training of rearrangement agents and their deployment on physical systems.

* Authors are listed in alphabetical order

Via

Access Paper or Ask Questions

Shape, Illumination, and Reflectance from Shading

Oct 07, 2020
Jonathan T. Barron, Jitendra Malik

Figure 1 for Shape, Illumination, and Reflectance from Shading

Figure 2 for Shape, Illumination, and Reflectance from Shading

Figure 3 for Shape, Illumination, and Reflectance from Shading

Figure 4 for Shape, Illumination, and Reflectance from Shading

A fundamental problem in computer vision is that of inferring the intrinsic, 3D structure of the world from flat, 2D images of that world. Traditional methods for recovering scene properties such as shape, reflectance, or illumination rely on multiple observations of the same scene to overconstrain the problem. Recovering these same properties from a single image seems almost impossible in comparison -- there are an infinite number of shapes, paint, and lights that exactly reproduce a single image. However, certain explanations are more likely than others: surfaces tend to be smooth, paint tends to be uniform, and illumination tends to be natural. We therefore pose this problem as one of statistical inference, and define an optimization problem that searches for the *most likely* explanation of a single image. Our technique can be viewed as a superset of several classic computer vision problems (shape-from-shading, intrinsic images, color constancy, illumination estimation, etc) and outperforms all previous solutions to those constituent problems.

* TPAMI 2015

Via

Access Paper or Ask Questions

Uncertainty Sets for Image Classifiers using Conformal Prediction

Sep 29, 2020
Anastasios Angelopoulos, Stephen Bates, Jitendra Malik, Michael I. Jordan

Figure 1 for Uncertainty Sets for Image Classifiers using Conformal Prediction

Figure 2 for Uncertainty Sets for Image Classifiers using Conformal Prediction

Figure 3 for Uncertainty Sets for Image Classifiers using Conformal Prediction

Figure 4 for Uncertainty Sets for Image Classifiers using Conformal Prediction

Convolutional image classifiers can achieve high predictive accuracy, but quantifying their uncertainty remains an unresolved challenge, hindering their deployment in consequential settings. Existing uncertainty quantification techniques, such as Platt scaling, attempt to calibrate the network's probability estimates, but they do not have formal guarantees. We present an algorithm that modifies any classifier to output a predictive set containing the true label with a user-specified probability, such as 90%. The algorithm is simple and fast like Platt scaling, but provides a formal finite-sample coverage guarantee for every model and dataset. Furthermore, our method generates much smaller predictive sets than alternative methods, since we introduce a regularizer to stabilize the small scores of unlikely classes after Platt scaling. In experiments on both Imagenet and Imagenet-V2 with a ResNet-152 and other classifiers, our scheme outperforms existing approaches, achieving exact coverage with sets that are often factors of 5 to 10 smaller.

* Codebase available at https://github.com/aangelopoulos/conformal_classification

Via

Access Paper or Ask Questions

Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild

Aug 19, 2020
Jason Y. Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, Angjoo Kanazawa

Figure 1 for Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild

Figure 2 for Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild

Figure 3 for Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild

Figure 4 for Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild

We present a method that infers spatial arrangements and shapes of humans and objects in a globally consistent 3D scene, all from a single image in-the-wild captured in an uncontrolled environment. Notably, our method runs on datasets without any scene- or object-level 3D supervision. Our key insight is that considering humans and objects jointly gives rise to "3D common sense" constraints that can be used to resolve ambiguity. In particular, we introduce a scale loss that learns the distribution of object size from data; an occlusion-aware silhouette re-projection loss to optimize object pose; and a human-object interaction loss to capture the spatial layout of objects with which humans interact. We empirically validate that our constraints dramatically reduce the space of likely 3D spatial configurations. We demonstrate our approach on challenging, in-the-wild images of humans interacting with large objects (such as bicycles, motorcycles, and surfboards) and handheld objects (such as laptops, tennis rackets, and skateboards). We quantify the ability of our approach to recover human-object arrangements and outline remaining challenges in this relatively domain. The project webpage can be found at https://jasonyzhang.com/phosa.

* In ECCV 2020. v2: Updated Related Work

Via

Access Paper or Ask Questions

Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Aug 05, 2020
Haozhi Qi, Xiaolong Wang, Deepak Pathak, Yi Ma, Jitendra Malik

Figure 1 for Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Figure 2 for Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Figure 3 for Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Figure 4 for Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Learning long-term dynamics models is the key to understanding physical common sense. Most existing approaches on learning dynamics from visual input sidestep long-term predictions by resorting to rapid re-planning with short-term models. This not only requires such models to be super accurate but also limits them only to tasks where an agent can continuously obtain feedback and take action at each step until completion. In this paper, we aim to leverage the ideas from success stories in visual recognition tasks to build object representations that can capture inter-object and object-environment interactions over a long range. To this end, we propose Region Proposal Interaction Networks (RPIN), which reason about each object's trajectory in a latent region-proposal feature space. Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin both in terms of prediction quality and their ability to plan for downstream tasks, and also generalize well to novel environments. Our code is available at https://github.com/HaozhiQi/RPIN.

* Code: https://github.com/HaozhiQi/RPIN; Website: https://haozhiqi.github.io/RPIN/

Via

Access Paper or Ask Questions

Shape and Viewpoint without Keypoints

Jul 21, 2020
Shubham Goel, Angjoo Kanazawa, Jitendra Malik

Figure 1 for Shape and Viewpoint without Keypoints

Figure 2 for Shape and Viewpoint without Keypoints

Figure 3 for Shape and Viewpoint without Keypoints

Figure 4 for Shape and Viewpoint without Keypoints

We present a learning framework that learns to recover the 3D shape, pose and texture from a single image, trained on an image collection without any ground truth 3D shape, multi-view, camera viewpoints or keypoint supervision. We approach this highly under-constrained problem in a "analysis by synthesis" framework where the goal is to predict the likely shape, texture and camera viewpoint that could produce the image with various learned category-specific priors. Our particular contribution in this paper is a representation of the distribution over cameras, which we call "camera-multiplex". Instead of picking a point estimate, we maintain a set of camera hypotheses that are optimized during training to best explain the image given the current shape and texture. We call our approach Unsupervised Category-Specific Mesh Reconstruction (U-CMR), and present qualitative and quantitative results on CUB, Pascal 3D and new web-scraped datasets. We obtain state-of-the-art camera prediction results and show that we can learn to predict diverse shapes and textures across objects using an image collection without any keypoint annotations or 3D ground truth. Project page: https://shubham-goel.github.io/ucmr

* Accepted at ECCV 2020

Via

Access Paper or Ask Questions

3D Shape Reconstruction from Vision and Touch

Jul 07, 2020
Edward J. Smith, Roberto Calandra, Adriana Romero, Georgia Gkioxari, David Meger, Jitendra Malik, Michal Drozdzal

Figure 1 for 3D Shape Reconstruction from Vision and Touch

Figure 2 for 3D Shape Reconstruction from Vision and Touch

Figure 3 for 3D Shape Reconstruction from Vision and Touch

Figure 4 for 3D Shape Reconstruction from Vision and Touch

When a toddler is presented a new toy, their instinctual behaviour is to pick it up and inspect it with their hand and eyes in tandem, clearly searching over its surface to properly understand what they are playing with. Here, touch provides high fidelity localized information while vision provides complementary global context. However, in 3D shape reconstruction, the complementary fusion of visual and haptic modalities remains largely unexplored. In this paper, we study this problem and present an effective chart-based approach to fusing vision and touch, which leverages advances in graph convolutional networks. To do so, we introduce a dataset of simulated touch and vision signals from the interaction between a robotic hand and a large array of 3D objects. Our results show that (1) leveraging both vision and touch signals consistently improves single-modality baselines; (2) our approach outperforms alternative modality fusion methods and strongly benefits from the proposed chart-based structure; (3) the reconstruction quality increases with the number of grasps provided; and (4) the touch information not only enhances the reconstruction at the touch site but also extrapolates to its local neighborhood.

* Submitted for review

Via

Access Paper or Ask Questions