Alert button
Picture for Alexander Sax

Alexander Sax

Alert button

Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans

Oct 11, 2021
Ainaz Eftekhar, Alexander Sax, Roman Bachmann, Jitendra Malik, Amir Zamir

Figure 1 for Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans
Figure 2 for Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans
Figure 3 for Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans
Figure 4 for Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans

This paper introduces a pipeline to parametrically sample and render multi-task vision datasets from comprehensive 3D scans from the real world. Changing the sampling parameters allows one to "steer" the generated datasets to emphasize specific information. In addition to enabling interesting lines of research, we show the tooling and generated data suffice to train robust vision models. Common architectures trained on a generated starter dataset reached state-of-the-art performance on multiple common vision tasks and benchmarks, despite having seen no benchmark or non-pipeline data. The depth estimation network outperforms MiDaS and the surface normal estimation network is the first to achieve human-level performance for in-the-wild surface normal estimation -- at least according to one metric on the OASIS benchmark. The Dockerized pipeline with CLI, the (mostly python) code, PyTorch dataloaders for the generated data, the generated starter dataset, download scripts and other utilities are available through our project website, https://omnidata.vision.

* ICCV 2021: See project website https://omnidata.vision 
Viaarxiv icon

Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

Nov 13, 2020
Bryan Chen, Alexander Sax, Gene Lewis, Iro Armeni, Silvio Savarese, Amir Zamir, Jitendra Malik, Lerrel Pinto

Figure 1 for Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation
Figure 2 for Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation
Figure 3 for Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation
Figure 4 for Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation

Vision-based robotics often separates the control loop into one module for perception and a separate module for control. It is possible to train the whole system end-to-end (e.g. with deep RL), but doing it "from scratch" comes with a high sample complexity cost and the final result is often brittle, failing unexpectedly if the test environment differs from that of training. We study the effects of using mid-level visual representations (features learned asynchronously for traditional computer vision objectives), as a generic and easy-to-decode perceptual state in an end-to-end RL framework. Mid-level representations encode invariances about the world, and we show that they aid generalization, improve sample complexity, and lead to a higher final performance. Compared to other approaches for incorporating invariances, such as domain randomization, asynchronously trained mid-level representations scale better: both to harder problems and to larger domain shifts. In practice, this means that mid-level representations could be used to successfully train policies for tasks where domain randomization and learning-from-scratch failed. We report results on both manipulation and navigation tasks, and for navigation include zero-shot sim-to-real experiments on real robots.

* Extended version of CoRL 2020 camera ready. Supplementary released separately 
Viaarxiv icon

Robust Learning Through Cross-Task Consistency

Jun 07, 2020
Amir Zamir, Alexander Sax, Teresa Yeo, Oğuzhan Kar, Nikhil Cheerla, Rohan Suri, Zhangjie Cao, Jitendra Malik, Leonidas Guibas

Figure 1 for Robust Learning Through Cross-Task Consistency
Figure 2 for Robust Learning Through Cross-Task Consistency
Figure 3 for Robust Learning Through Cross-Task Consistency
Figure 4 for Robust Learning Through Cross-Task Consistency

Visual perception entails solving a wide set of tasks, e.g., object detection, depth estimation, etc. The predictions made for multiple tasks from the same image are not independent, and therefore, are expected to be consistent. We propose a broadly applicable and fully computational method for augmenting learning with Cross-Task Consistency. The proposed formulation is based on inference-path invariance over a graph of arbitrary tasks. We observe that learning with cross-task consistency leads to more accurate predictions and better generalization to out-of-distribution inputs. This framework also leads to an informative unsupervised quantity, called Consistency Energy, based on measuring the intrinsic consistency of the system. Consistency Energy correlates well with the supervised error (r=0.67), thus it can be employed as an unsupervised confidence metric as well as for detection of out-of-distribution inputs (ROC-AUC=0.95). The evaluations are performed on multiple datasets, including Taskonomy, Replica, CocoDoom, and ApolloScape, and they benchmark cross-task consistency versus various baselines including conventional multi-task learning, cycle consistency, and analytical consistency.

* CVPR 2020 (Oral). Project website, models, live demo at http://consistency.epfl.ch/ 
Viaarxiv icon

Side-Tuning: Network Adaptation via Additive Side Networks

Dec 31, 2019
Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, Jitendra Malik

Figure 1 for Side-Tuning: Network Adaptation via Additive Side Networks
Figure 2 for Side-Tuning: Network Adaptation via Additive Side Networks
Figure 3 for Side-Tuning: Network Adaptation via Additive Side Networks
Figure 4 for Side-Tuning: Network Adaptation via Additive Side Networks

When training a neural network for a desired task, one may prefer to adapt a pre-trained network rather than start with a randomly initialized one -- due to lacking enough training data, performing lifelong learning where the system has to learn a new task while being previously trained for other tasks, or wishing to encode priors in the network via preset weights. The most commonly employed approaches for network adaptation are fine-tuning and using the pre-trained network as a fixed feature extractor, among others. In this paper, we propose a straightforward alternative: Side-Tuning. Side-tuning adapts a pre-trained network by training a lightweight "side" network that is fused with the (unchanged) pre-trained network using summation. This simple method works as well as or better than existing solutions while it resolves some of the basic issues with fine-tuning, fixed features, and several other common baselines. In particular, side-tuning is less prone to overfitting when little training data is available, yields better results than using a fixed feature extractor, and does not suffer from catastrophic forgetting in lifelong learning. We demonstrate the performance of side-tuning under a diverse set of scenarios, including lifelong learning (iCIFAR, Taskonomy), reinforcement learning, imitation learning (visual navigation in Habitat), NLP question-answering (SQuAD v2), and single-task transfer learning (Taskonomy), with consistently promising results.

* See project website at http://sidetuning.berkeley.edu 
Viaarxiv icon

Learning to Navigate Using Mid-Level Visual Priors

Dec 23, 2019
Alexander Sax, Jeffrey O. Zhang, Bradley Emi, Amir Zamir, Silvio Savarese, Leonidas Guibas, Jitendra Malik

Figure 1 for Learning to Navigate Using Mid-Level Visual Priors
Figure 2 for Learning to Navigate Using Mid-Level Visual Priors
Figure 3 for Learning to Navigate Using Mid-Level Visual Priors
Figure 4 for Learning to Navigate Using Mid-Level Visual Priors

How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. navigating a complex environment)? What are the consequences of not utilizing such visual priors in learning? We study these questions by integrating a generic perceptual skill set (a distance estimator, an edge detector, etc.) within a reinforcement learning framework (see Fig. 1). This skill set ("mid-level vision") provides the policy with a more processed state of the world compared to raw images. Our large-scale study demonstrates that using mid-level vision results in policies that learn faster, generalize better, and achieve higher final performance, when compared to learning from scratch and/or using state-of-the-art visual and non-visual representation learning methods. We show that conventional computer vision objectives are particularly effective in this regard and can be conveniently integrated into reinforcement learning frameworks. Finally, we found that no single visual representation was universally useful for all downstream tasks, hence we computationally derive a task-agnostic set of representations optimized to support arbitrary downstream tasks.

* In Conference on Robot Learning, 2019. See project website and demos at http://perceptual.actor/ 
Viaarxiv icon

Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Active Tasks

Dec 31, 2018
Alexander Sax, Bradley Emi, Amir R. Zamir, Leonidas Guibas, Silvio Savarese, Jitendra Malik

Figure 1 for Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Active Tasks
Figure 2 for Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Active Tasks
Figure 3 for Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Active Tasks
Figure 4 for Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Active Tasks

One of the ultimate promises of computer vision is to help robotic agents perform active tasks, like delivering packages or doing household chores. However, the conventional approach to solving "vision" is to define a set of offline recognition problems (e.g. object detection) and solve those first. This approach faces a challenge from the recent rise of Deep Reinforcement Learning frameworks that learn active tasks from scratch using images as input. This poses a set of fundamental questions: what is the role of computer vision if everything can be learned from scratch? Could intermediate vision tasks actually be useful for performing arbitrary downstream active tasks? We show that proper use of mid-level perception confers significant advantages over training from scratch. We implement a perception module as a set of mid-level visual representations and demonstrate that learning active tasks with mid-level features is significantly more sample-efficient than scratch and able to generalize in situations where the from-scratch approach fails. However, we show that realizing these gains requires careful selection of the particular mid-level features for each downstream task. Finally, we put forth a simple and efficient perception module based on the results of our study, which can be adopted as a rather generic perception module for active frameworks.

* See project website and demos at http://perceptual.actor 
Viaarxiv icon

Gibson Env: Real-World Perception for Embodied Agents

Aug 31, 2018
Fei Xia, Amir Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, Silvio Savarese

Figure 1 for Gibson Env: Real-World Perception for Embodied Agents
Figure 2 for Gibson Env: Real-World Perception for Embodied Agents
Figure 3 for Gibson Env: Real-World Perception for Embodied Agents
Figure 4 for Gibson Env: Real-World Perception for Embodied Agents

Developing visual perception models for active agents and sensorimotor control are cumbersome to be done in the physical world, as existing algorithms are too slow to efficiently learn in real-time and robots are fragile and costly. This has given rise to learning-in-simulation which consequently casts a question on whether the results transfer to real-world. In this paper, we are concerned with the problem of developing real-world perception for active agents, propose Gibson Virtual Environment for this purpose, and showcase sample perceptual tasks learned therein. Gibson is based on virtualizing real spaces, rather than using artificially designed ones, and currently includes over 1400 floor spaces from 572 full buildings. The main characteristics of Gibson are: I. being from the real-world and reflecting its semantic complexity, II. having an internal synthesis mechanism, "Goggles", enabling deploying the trained models in real-world without needing further domain adaptation, III. embodiment of agents and making them subject to constraints of physics and space.

* CVPR 2018  
* Access the code, dataset, and project website at http://gibsonenv.vision/ . CVPR 2018 
Viaarxiv icon

Taskonomy: Disentangling Task Transfer Learning

Apr 23, 2018
Amir Zamir, Alexander Sax, William Shen, Leonidas Guibas, Jitendra Malik, Silvio Savarese

Figure 1 for Taskonomy: Disentangling Task Transfer Learning
Figure 2 for Taskonomy: Disentangling Task Transfer Learning
Figure 3 for Taskonomy: Disentangling Task Transfer Learning
Figure 4 for Taskonomy: Disentangling Task Transfer Learning

Do visual tasks have a relationship, or are they unrelated? For instance, could having surface normals simplify estimating the depth of an image? Intuition answers these questions positively, implying existence of a structure among visual tasks. Knowing this structure has notable values; it is the concept underlying transfer learning and provides a principled way for identifying redundancies across tasks, e.g., to seamlessly reuse supervision among related tasks or solve many tasks in one system without piling up the complexity. We proposes a fully computational approach for modeling the structure of space of visual tasks. This is done via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space. The product is a computational taxonomic map for task transfer learning. We study the consequences of this structure, e.g. nontrivial emerged relationships, and exploit them to reduce the demand for labeled data. For example, we show that the total number of labeled datapoints needed for solving a set of 10 tasks can be reduced by roughly 2/3 (compared to training independently) while keeping the performance nearly the same. We provide a set of tools for computing and probing this taxonomical structure including a solver that users can employ to devise efficient supervision policies for their use cases.

* CVPR 2018 (Oral). See project website and live demos at http://taskonomy.vision/ 
Viaarxiv icon