Alert button
Picture for Sagar Vaze

Sagar Vaze

Alert button

GeneCIS: A Benchmark for General Conditional Image Similarity

Jun 13, 2023
Sagar Vaze, Nicolas Carion, Ishan Misra

Figure 1 for GeneCIS: A Benchmark for General Conditional Image Similarity
Figure 2 for GeneCIS: A Benchmark for General Conditional Image Similarity
Figure 3 for GeneCIS: A Benchmark for General Conditional Image Similarity
Figure 4 for GeneCIS: A Benchmark for General Conditional Image Similarity

We argue that there are many notions of 'similarity' and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning methods, supervised or self-supervised, which learn a fixed embedding function and hence implicitly assume a single notion of similarity. For instance, models trained on ImageNet are biased towards object categories, while a user might prefer the model to focus on colors, textures or specific elements in the scene. In this paper, we propose the GeneCIS ('genesis') benchmark, which measures models' ability to adapt to a range of similarity conditions. Extending prior work, our benchmark is designed for zero-shot evaluation only, and hence considers an open-set of similarity conditions. We find that baselines from powerful CLIP models struggle on GeneCIS and that performance on the benchmark is only weakly correlated with ImageNet accuracy, suggesting that simply scaling existing methods is not fruitful. We further propose a simple, scalable solution based on automatically mining information from existing image-caption datasets. We find our method offers a substantial boost over the baselines on GeneCIS, and further improves zero-shot performance on related image retrieval benchmarks. In fact, though evaluated zero-shot, our model surpasses state-of-the-art supervised models on MIT-States. Project page at https://sgvaze.github.io/genecis/.

* CVPR 2023 (Highlighted Paper). Project page at https://sgvaze.github.io/genecis/ 
Viaarxiv icon

What's in a Name? Beyond Class Indices for Image Recognition

Apr 05, 2023
Kai Han, Yandong Li, Sagar Vaze, Jie Li, Xuhui Jia

Figure 1 for What's in a Name? Beyond Class Indices for Image Recognition
Figure 2 for What's in a Name? Beyond Class Indices for Image Recognition
Figure 3 for What's in a Name? Beyond Class Indices for Image Recognition
Figure 4 for What's in a Name? Beyond Class Indices for Image Recognition

Existing machine learning models demonstrate excellent performance in image object recognition after training on a large-scale dataset under full supervision. However, these models only learn to map an image to a predefined class index, without revealing the actual semantic meaning of the object in the image. In contrast, vision-language models like CLIP are able to assign semantic class names to unseen objects in a `zero-shot' manner, although they still rely on a predefined set of candidate names at test time. In this paper, we reconsider the recognition problem and task a vision-language model to assign class names to images given only a large and essentially unconstrained vocabulary of categories as prior information. We use non-parametric methods to establish relationships between images which allow the model to automatically narrow down the set of possible candidate names. Specifically, we propose iteratively clustering the data and voting on class names within them, showing that this enables a roughly 50\% improvement over the baseline on ImageNet. Furthermore, we tackle this problem both in unsupervised and partially supervised settings, as well as with a coarse-grained and fine-grained search space as the unconstrained dictionary.

Viaarxiv icon

Zero-Shot Category-Level Object Pose Estimation

Apr 07, 2022
Walter Goodwin, Sagar Vaze, Ioannis Havoutis, Ingmar Posner

Figure 1 for Zero-Shot Category-Level Object Pose Estimation
Figure 2 for Zero-Shot Category-Level Object Pose Estimation
Figure 3 for Zero-Shot Category-Level Object Pose Estimation
Figure 4 for Zero-Shot Category-Level Object Pose Estimation

Object pose estimation is an important component of most vision pipelines for embodied agents, as well as in 3D vision more generally. In this paper we tackle the problem of estimating the pose of novel object categories in a zero-shot manner. This extends much of the existing literature by removing the need for pose-labelled datasets or category-specific CAD models for training or inference. Specifically, we make the following contributions. First, we formalise the zero-shot, category-level pose estimation problem and frame it in a way that is most applicable to real-world embodied agents. Secondly, we propose a novel method based on semantic correspondences from a self-supervised vision transformer to solve the pose estimation problem. We further re-purpose the recent CO3D dataset to present a controlled and realistic test setting. Finally, we demonstrate that all baselines for our proposed task perform poorly, and show that our method provides a six-fold improvement in average rotation accuracy at 30 degrees. Our code is available at https://github.com/applied-ai-lab/zero-shot-pose.

* 29 pages, 5 figures 
Viaarxiv icon

Generalized Category Discovery

Jan 07, 2022
Sagar Vaze, Kai Han, Andrea Vedaldi, Andrew Zisserman

Figure 1 for Generalized Category Discovery
Figure 2 for Generalized Category Discovery
Figure 3 for Generalized Category Discovery
Figure 4 for Generalized Category Discovery

In this paper, we consider a highly general image recognition setting wherein, given a labelled and unlabelled set of images, the task is to categorize all images in the unlabelled set. Here, the unlabelled images may come from labelled classes or from novel ones. Existing recognition methods are not able to deal with this setting, because they make several restrictive assumptions, such as the unlabelled instances only coming from known - or unknown - classes and the number of unknown classes being known a-priori. We address the more unconstrained setting, naming it 'Generalized Category Discovery', and challenge all these assumptions. We first establish strong baselines by taking state-of-the-art algorithms from novel category discovery and adapting them for this task. Next, we propose the use of vision transformers with contrastive representation learning for this open world setting. We then introduce a simple yet effective semi-supervised $k$-means method to cluster the unlabelled data into seen and unseen classes automatically, substantially outperforming the baselines. Finally, we also propose a new approach to estimate the number of classes in the unlabelled data. We thoroughly evaluate our approach on public datasets for generic object classification including CIFAR10, CIFAR100 and ImageNet-100, and for fine-grained visual recognition including CUB, Stanford Cars and Herbarium19, benchmarking on this new setting to foster future research.

* 13 pages, 6 figures 
Viaarxiv icon

Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement

Nov 15, 2021
Walter Goodwin, Sagar Vaze, Ioannis Havoutis, Ingmar Posner

Figure 1 for Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement
Figure 2 for Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement
Figure 3 for Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement
Figure 4 for Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement

Object rearrangement has recently emerged as a key competency in robot manipulation, with practical solutions generally involving object detection, recognition, grasping and high-level planning. Goal-images describing a desired scene configuration are a promising and increasingly used mode of instruction. A key outstanding challenge is the accurate inference of matches between objects in front of a robot, and those seen in a provided goal image, where recent works have struggled in the absence of object-specific training data. In this work, we explore the deterioration of existing methods' ability to infer matches between objects as the visual shift between observed and goal scenes increases. We find that a fundamental limitation of the current setting is that source and target images must contain the same $\textit{instance}$ of every object, which restricts practical deployment. We present a novel approach to object matching that uses a large pre-trained vision-language model to match objects in a cross-instance setting by leveraging semantics together with visual features as a more robust, and much more general, measure of similarity. We demonstrate that this provides considerably improved matching performance in cross-instance settings, and can be used to guide multi-object rearrangement with a robot manipulator from an image that shares no object $\textit{instances}$ with the robot's scene.

* 8 pages, 5 figures 
Viaarxiv icon

Open-Set Recognition: A Good Closed-Set Classifier is All You Need

Oct 12, 2021
Sagar Vaze, Kai Han, Andrea Vedaldi, Andrew Zisserman

Figure 1 for Open-Set Recognition: A Good Closed-Set Classifier is All You Need
Figure 2 for Open-Set Recognition: A Good Closed-Set Classifier is All You Need
Figure 3 for Open-Set Recognition: A Good Closed-Set Classifier is All You Need
Figure 4 for Open-Set Recognition: A Good Closed-Set Classifier is All You Need

The ability to identify whether or not a test sample belongs to one of the semantic classes in a classifier's training set is critical to practical deployment of the model. This task is termed open-set recognition (OSR) and has received significant attention in recent years. In this paper, we first demonstrate that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes. We find that this relationship holds across loss objectives and architectures, and further demonstrate the trend both on the standard OSR benchmarks as well as on a large-scale ImageNet evaluation. Second, we use this correlation to boost the performance of the cross-entropy OSR 'baseline' by improving its closed-set accuracy, and with this strong baseline achieve a new state-of-the-art on the most challenging OSR benchmark. Similarly, we boost the performance of the existing state-of-the-art method by improving its closed-set accuracy, but this does not surpass the strong baseline on the most challenging dataset. Our third contribution is to reappraise the datasets used for OSR evaluation, and construct new benchmarks which better respect the task of detecting semantic novelty, as opposed to low-level distributional shifts as tackled by neighbouring machine learning fields. In this new setting, we again demonstrate that there is negligible difference between the strong baseline and the existing state-of-the-art.

* 23 pages, 8 figures 
Viaarxiv icon

Optimal Use of Multi-spectral Satellite Data with Convolutional Neural Networks

Sep 15, 2020
Sagar Vaze, James Foley, Mohamed Seddiq, Alexey Unagaev, Natalia Efremova

Figure 1 for Optimal Use of Multi-spectral Satellite Data with Convolutional Neural Networks
Figure 2 for Optimal Use of Multi-spectral Satellite Data with Convolutional Neural Networks
Figure 3 for Optimal Use of Multi-spectral Satellite Data with Convolutional Neural Networks
Figure 4 for Optimal Use of Multi-spectral Satellite Data with Convolutional Neural Networks

The analysis of satellite imagery will prove a crucial tool in the pursuit of sustainable development. While Convolutional Neural Networks (CNNs) have made large gains in natural image analysis, their application to multi-spectral satellite images (wherein input images have a large number of channels) remains relatively unexplored. In this paper, we compare different methods of leveraging multi-band information with CNNs, demonstrating the performance of all compared methods on the task of semantic segmentation of agricultural vegetation (vineyards). We show that standard industry practice of using bands selected by a domain expert leads to a significantly worse test accuracy than the other methods compared. Specifically, we compare: using bands specified by an expert; using all available bands; learning attention maps over the input bands; and leveraging Bayesian optimisation to dictate band choice. We show that simply using all available band information already increases test time performance, and show that the Bayesian optimisation, first applied to band selection in this work, can be used to further boost accuracy.

* AI for Social Good workshop - Harvard CRCS 
Viaarxiv icon

SMArtCast: Predicting soil moisture interpolations into the future using Earth observation data in a deep learning framework

Mar 16, 2020
Conrad James Foley, Sagar Vaze, Mohamed El Amine Seddiq, Alexey Unagaev, Natalia Efremova

Figure 1 for SMArtCast: Predicting soil moisture interpolations into the future using Earth observation data in a deep learning framework
Figure 2 for SMArtCast: Predicting soil moisture interpolations into the future using Earth observation data in a deep learning framework
Figure 3 for SMArtCast: Predicting soil moisture interpolations into the future using Earth observation data in a deep learning framework

Soil moisture is critical component of crop health and monitoring it can enable further actions for increasing yield or preventing catastrophic die off. As climate change increases the likelihood of extreme weather events and reduces the predictability of weather, and non-optimal soil moistures for crops may become more likely. In this work, we a series of LSTM architectures to analyze measurements of soil moisture and vegetation indiced derived from satellite imagery. The system learns to predict the future values of these measurements. These spatially sparse values and indices are used as input features to an interpolation method that infer spatially dense moisture map for a future time point. This has the potential to provide advance warning for soil moistures that may be inhospitable to crops across an area with limited monitoring capacity.

* ICLR 2020  
Viaarxiv icon