Alert button
Picture for Amir Zamir

Amir Zamir

Alert button

Modality-invariant Visual Odometry for Embodied Vision

Apr 29, 2023
Marius Memmel, Roman Bachmann, Amir Zamir

Figure 1 for Modality-invariant Visual Odometry for Embodied Vision
Figure 2 for Modality-invariant Visual Odometry for Embodied Vision
Figure 3 for Modality-invariant Visual Odometry for Embodied Vision
Figure 4 for Modality-invariant Visual Odometry for Embodied Vision

Effectively localizing an agent in a realistic, noisy setting is crucial for many embodied vision tasks. Visual Odometry (VO) is a practical substitute for unreliable GPS and compass sensors, especially in indoor environments. While SLAM-based methods show a solid performance without large data requirements, they are less flexible and robust w.r.t. to noise and changes in the sensor suite compared to learning-based approaches. Recent deep VO models, however, limit themselves to a fixed set of input modalities, e.g., RGB and depth, while training on millions of samples. When sensors fail, sensor suites change, or modalities are intentionally looped out due to available resources, e.g., power consumption, the models fail catastrophically. Furthermore, training these models from scratch is even more expensive without simulator access or suitable existing models that can be fine-tuned. While such scenarios get mostly ignored in simulation, they commonly hinder a model's reusability in real-world applications. We propose a Transformer-based modality-invariant VO approach that can deal with diverse or changing sensor suites of navigation agents. Our model outperforms previous methods while training on only a fraction of the data. We hope this method opens the door to a broader range of real-world applications that can benefit from flexible and learned VO models.

Viaarxiv icon

An Information-Theoretic Approach to Transferability in Task Transfer Learning

Dec 20, 2022
Yajie Bao, Yang Li, Shao-Lun Huang, Lin Zhang, Lizhong Zheng, Amir Zamir, Leonidas Guibas

Figure 1 for An Information-Theoretic Approach to Transferability in Task Transfer Learning
Figure 2 for An Information-Theoretic Approach to Transferability in Task Transfer Learning
Figure 3 for An Information-Theoretic Approach to Transferability in Task Transfer Learning
Figure 4 for An Information-Theoretic Approach to Transferability in Task Transfer Learning

Task transfer learning is a popular technique in image processing applications that uses pre-trained models to reduce the supervision cost of related tasks. An important question is to determine task transferability, i.e. given a common input domain, estimating to what extent representations learned from a source task can help in learning a target task. Typically, transferability is either measured experimentally or inferred through task relatedness, which is often defined without a clear operational meaning. In this paper, we present a novel metric, H-score, an easily-computable evaluation function that estimates the performance of transferred representations from one task to another in classification problems using statistical and information theoretic principles. Experiments on real image data show that our metric is not only consistent with the empirical transferability measurement, but also useful to practitioners in applications such as source model selection and task transfer curriculum learning.

* 2019 IEEE International Conference on Image Processing (ICIP) (pp. 2309-2313). IEEE  
Viaarxiv icon

PALMER: Perception-Action Loop with Memory for Long-Horizon Planning

Dec 08, 2022
Onur Beker, Mohammad Mohammadi, Amir Zamir

Figure 1 for PALMER: Perception-Action Loop with Memory for Long-Horizon Planning
Figure 2 for PALMER: Perception-Action Loop with Memory for Long-Horizon Planning
Figure 3 for PALMER: Perception-Action Loop with Memory for Long-Horizon Planning
Figure 4 for PALMER: Perception-Action Loop with Memory for Long-Horizon Planning

To achieve autonomy in a priori unknown real-world scenarios, agents should be able to: i) act from high-dimensional sensory observations (e.g., images), ii) learn from past experience to adapt and improve, and iii) be capable of long horizon planning. Classical planning algorithms (e.g. PRM, RRT) are proficient at handling long-horizon planning. Deep learning based methods in turn can provide the necessary representations to address the others, by modeling statistical contingencies between observations. In this direction, we introduce a general-purpose planning algorithm called PALMER that combines classical sampling-based planning algorithms with learning-based perceptual representations. For training these perceptual representations, we combine Q-learning with contrastive representation learning to create a latent space where the distance between the embeddings of two states captures how easily an optimal policy can traverse between them. For planning with these perceptual representations, we re-purpose classical sampling-based planning algorithms to retrieve previously observed trajectory segments from a replay buffer and restitch them into approximately optimal paths that connect any given pair of start and goal states. This creates a tight feedback loop between representation learning, memory, reinforcement learning, and sampling-based planning. The end result is an experiential framework for long-horizon planning that is significantly more robust and sample efficient compared to existing methods.

* Website: https://palmer.epfl.ch 
Viaarxiv icon

Task Discovery: Finding the Tasks that Neural Networks Generalize on

Dec 01, 2022
Andrei Atanov, Andrei Filatov, Teresa Yeo, Ajay Sohmshetty, Amir Zamir

Figure 1 for Task Discovery: Finding the Tasks that Neural Networks Generalize on
Figure 2 for Task Discovery: Finding the Tasks that Neural Networks Generalize on
Figure 3 for Task Discovery: Finding the Tasks that Neural Networks Generalize on
Figure 4 for Task Discovery: Finding the Tasks that Neural Networks Generalize on

When developing deep learning models, we usually decide what task we want to solve then search for a model that generalizes well on the task. An intriguing question would be: what if, instead of fixing the task and searching in the model space, we fix the model and search in the task space? Can we find tasks that the model generalizes on? How do they look, or do they indicate anything? These are the questions we address in this paper. We propose a task discovery framework that automatically finds examples of such tasks via optimizing a generalization-based quantity called agreement score. We demonstrate that one set of images can give rise to many tasks on which neural networks generalize well. These tasks are a reflection of the inductive biases of the learning framework and the statistical patterns present in the data, thus they can make a useful tool for analysing the neural networks and their biases. As an example, we show that the discovered tasks can be used to automatically create adversarial train-test splits which make a model fail at test time, without changing the pixels or labels, but by only selecting how the datapoints should be split between the train and test sets. We end with a discussion on human-interpretability of the discovered tasks.

* NeurIPS 2022, Project page at https://taskdiscovery.epfl.ch 
Viaarxiv icon

MultiMAE: Multi-modal Multi-task Masked Autoencoders

Apr 04, 2022
Roman Bachmann, David Mizrahi, Andrei Atanov, Amir Zamir

Figure 1 for MultiMAE: Multi-modal Multi-task Masked Autoencoders
Figure 2 for MultiMAE: Multi-modal Multi-task Masked Autoencoders
Figure 3 for MultiMAE: Multi-modal Multi-task Masked Autoencoders
Figure 4 for MultiMAE: Multi-modal Multi-task Masked Autoencoders

We propose a pre-training strategy called Multi-modal Multi-task Masked Autoencoders (MultiMAE). It differs from standard Masked Autoencoding in two key aspects: I) it can optionally accept additional modalities of information in the input besides the RGB image (hence "multi-modal"), and II) its training objective accordingly includes predicting multiple outputs besides the RGB image (hence "multi-task"). We make use of masking (across image patches and input modalities) to make training MultiMAE tractable as well as to ensure cross-modality predictive coding is indeed learned by the network. We show this pre-training strategy leads to a flexible, simple, and efficient framework with improved transfer results to downstream tasks. In particular, the same exact pre-trained network can be flexibly used when additional information besides RGB images is available or when no information other than RGB is available - in all configurations yielding competitive to or significantly better results than the baselines. To avoid needing training datasets with multiple modalities and tasks, we train MultiMAE entirely using pseudo labeling, which makes the framework widely applicable to any RGB dataset. The experiments are performed on multiple transfer tasks (image classification, semantic segmentation, depth estimation) and datasets (ImageNet, ADE20K, Taskonomy, Hypersim, NYUv2). The results show an intriguingly impressive capability by the model in cross-modal/task predictive coding and transfer.

* Project page at https://multimae.epfl.ch 
Viaarxiv icon

3D Common Corruptions and Data Augmentation

Apr 04, 2022
Oğuzhan Fatih Kar, Teresa Yeo, Andrei Atanov, Amir Zamir

Figure 1 for 3D Common Corruptions and Data Augmentation
Figure 2 for 3D Common Corruptions and Data Augmentation
Figure 3 for 3D Common Corruptions and Data Augmentation
Figure 4 for 3D Common Corruptions and Data Augmentation

We introduce a set of image transformations that can be used as corruptions to evaluate the robustness of models as well as data augmentation mechanisms for training neural networks. The primary distinction of the proposed transformations is that, unlike existing approaches such as Common Corruptions, the geometry of the scene is incorporated in the transformations -- thus leading to corruptions that are more likely to occur in the real world. We also introduce a set of semantic corruptions (e.g. natural object occlusions). We show these transformations are `efficient' (can be computed on-the-fly), `extendable' (can be applied on most image datasets), expose vulnerability of existing models, and can effectively make models more robust when employed as `3D data augmentation' mechanisms. The evaluations on several tasks and datasets suggest incorporating 3D information into benchmarking and training opens up a promising direction for robustness research.

* CVPR 2022 (Oral). Project website at https://3dcommoncorruptions.epfl.ch/ 
Viaarxiv icon

CLIPasso: Semantically-Aware Object Sketching

Feb 11, 2022
Yael Vinker, Ehsan Pajouheshgar, Jessica Y. Bo, Roman Christian Bachmann, Amit Haim Bermano, Daniel Cohen-Or, Amir Zamir, Ariel Shamir

Figure 1 for CLIPasso: Semantically-Aware Object Sketching
Figure 2 for CLIPasso: Semantically-Aware Object Sketching
Figure 3 for CLIPasso: Semantically-Aware Object Sketching
Figure 4 for CLIPasso: Semantically-Aware Object Sketching

Abstraction is at the heart of sketching due to the simple and minimal nature of line drawings. Abstraction entails identifying the essential visual properties of an object or scene, which requires semantic understanding and prior knowledge of high-level concepts. Abstract depictions are therefore challenging for artists, and even more so for machines. We present an object sketching method that can achieve different levels of abstraction, guided by geometric and semantic simplifications. While sketch generation methods often rely on explicit sketch datasets for training, we utilize the remarkable ability of CLIP (Contrastive-Language-Image-Pretraining) to distill semantic concepts from sketches and images alike. We define a sketch as a set of B\'ezier curves and use a differentiable rasterizer to optimize the parameters of the curves directly with respect to a CLIP-based perceptual loss. The abstraction degree is controlled by varying the number of strokes. The generated sketches demonstrate multiple levels of abstraction while maintaining recognizability, underlying structure, and essential visual components of the subject drawn.

* https://clipasso.github.io/clipasso/ 
Viaarxiv icon

Simple Control Baselines for Evaluating Transfer Learning

Feb 07, 2022
Andrei Atanov, Shijian Xu, Onur Beker, Andrei Filatov, Amir Zamir

Figure 1 for Simple Control Baselines for Evaluating Transfer Learning
Figure 2 for Simple Control Baselines for Evaluating Transfer Learning
Figure 3 for Simple Control Baselines for Evaluating Transfer Learning
Figure 4 for Simple Control Baselines for Evaluating Transfer Learning

Transfer learning has witnessed remarkable progress in recent years, for example, with the introduction of augmentation-based contrastive self-supervised learning methods. While a number of large-scale empirical studies on the transfer performance of such models have been conducted, there is not yet an agreed-upon set of control baselines, evaluation practices, and metrics to report, which often hinders a nuanced and calibrated understanding of the real efficacy of the methods. We share an evaluation standard that aims to quantify and communicate transfer learning performance in an informative and accessible setup. This is done by baking a number of simple yet critical control baselines in the evaluation method, particularly the blind-guess (quantifying the dataset bias), scratch-model (quantifying the architectural contribution), and maximal-supervision (quantifying the upper-bound). To demonstrate how the evaluation standard can be employed, we provide an example empirical study investigating a few basic questions about self-supervised learning. For example, using this standard, the study shows the effectiveness of existing self-supervised pre-training methods is skewed towards image classification tasks versus dense pixel-wise predictions. In general, we encourage using/reporting the suggested control baselines in evaluating transfer learning in order to gain a more meaningful and informative understanding.

* Project website: https://transfer-controls.epfl.ch 
Viaarxiv icon

Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans

Oct 11, 2021
Ainaz Eftekhar, Alexander Sax, Roman Bachmann, Jitendra Malik, Amir Zamir

Figure 1 for Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans
Figure 2 for Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans
Figure 3 for Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans
Figure 4 for Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans

This paper introduces a pipeline to parametrically sample and render multi-task vision datasets from comprehensive 3D scans from the real world. Changing the sampling parameters allows one to "steer" the generated datasets to emphasize specific information. In addition to enabling interesting lines of research, we show the tooling and generated data suffice to train robust vision models. Common architectures trained on a generated starter dataset reached state-of-the-art performance on multiple common vision tasks and benchmarks, despite having seen no benchmark or non-pipeline data. The depth estimation network outperforms MiDaS and the surface normal estimation network is the first to achieve human-level performance for in-the-wild surface normal estimation -- at least according to one metric on the OASIS benchmark. The Dockerized pipeline with CLI, the (mostly python) code, PyTorch dataloaders for the generated data, the generated starter dataset, download scripts and other utilities are available through our project website, https://omnidata.vision.

* ICCV 2021: See project website https://omnidata.vision 
Viaarxiv icon