Alert button
Picture for Zsolt Kira

Zsolt Kira

Alert button

LatentDR: Improving Model Generalization Through Sample-Aware Latent Degradation and Restoration

Aug 28, 2023
Ran Liu, Sahil Khose, Jingyun Xiao, Lakshmi Sathidevi, Keerthan Ramnath, Zsolt Kira, Eva L. Dyer

Despite significant advances in deep learning, models often struggle to generalize well to new, unseen domains, especially when training data is limited. To address this challenge, we propose a novel approach for distribution-aware latent augmentation that leverages the relationships across samples to guide the augmentation procedure. Our approach first degrades the samples stochastically in the latent space, mapping them to augmented labels, and then restores the samples from their corrupted versions during training. This process confuses the classifier in the degradation step and restores the overall class distribution of the original samples, promoting diverse intra-class/cross-domain variability. We extensively evaluate our approach on a diverse set of datasets and tasks, including domain generalization benchmarks and medical imaging datasets with strong domain shift, where we show our approach achieves significant improvements over existing methods for latent space augmentation. We further show that our method can be flexibly adapted to long-tail recognition tasks, demonstrating its versatility in building more generalizable models. Code is available at https://github.com/nerdslab/LatentDR.

Viaarxiv icon

NeO 360: Neural Fields for Sparse View Synthesis of Outdoor Scenes

Aug 24, 2023
Muhammad Zubair Irshad, Sergey Zakharov, Katherine Liu, Vitor Guizilini, Thomas Kollar, Adrien Gaidon, Zsolt Kira, Rares Ambrus

Figure 1 for NeO 360: Neural Fields for Sparse View Synthesis of Outdoor Scenes
Figure 2 for NeO 360: Neural Fields for Sparse View Synthesis of Outdoor Scenes
Figure 3 for NeO 360: Neural Fields for Sparse View Synthesis of Outdoor Scenes
Figure 4 for NeO 360: Neural Fields for Sparse View Synthesis of Outdoor Scenes

Recent implicit neural representations have shown great results for novel view synthesis. However, existing methods require expensive per-scene optimization from many views hence limiting their application to real-world unbounded urban settings where the objects of interest or backgrounds are observed from very few views. To mitigate this challenge, we introduce a new approach called NeO 360, Neural fields for sparse view synthesis of outdoor scenes. NeO 360 is a generalizable method that reconstructs 360{\deg} scenes from a single or a few posed RGB images. The essence of our approach is in capturing the distribution of complex real-world outdoor 3D scenes and using a hybrid image-conditional triplanar representation that can be queried from any world point. Our representation combines the best of both voxel-based and bird's-eye-view (BEV) representations and is more effective and expressive than each. NeO 360's representation allows us to learn from a large collection of unbounded 3D scenes while offering generalizability to new views and novel scenes from as few as a single image during inference. We demonstrate our approach on the proposed challenging 360{\deg} unbounded dataset, called NeRDS 360, and show that NeO 360 outperforms state-of-the-art generalizable methods for novel view synthesis while also offering editing and composition capabilities. Project page: https://zubair-irshad.github.io/projects/neo360.html

* Accepted to International Conference on Computer Vision (ICCV), 2023. Project page: https://zubair-irshad.github.io/projects/neo360.html 
Viaarxiv icon

Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

Aug 23, 2023
Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, Mar Gonzalez-Franco

Figure 1 for Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion
Figure 2 for Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion
Figure 3 for Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion
Figure 4 for Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

Producing quality segmentation masks for images is a fundamental problem in computer vision. Recent research has explored large-scale supervised training to enable zero-shot segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However, constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper, we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically, we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27, our method surpasses the prior unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU.

Viaarxiv icon

HomeRobot: Open-Vocabulary Mobile Manipulation

Jun 20, 2023
Sriram Yenamandra, Arun Ramachandran, Karmesh Yadav, Austin Wang, Mukul Khanna, Theophile Gervet, Tsung-Yen Yang, Vidhi Jain, Alexander William Clegg, John Turner, Zsolt Kira, Manolis Savva, Angel Chang, Devendra Singh Chaplot, Dhruv Batra, Roozbeh Mottaghi, Yonatan Bisk, Chris Paxton

Figure 1 for HomeRobot: Open-Vocabulary Mobile Manipulation
Figure 2 for HomeRobot: Open-Vocabulary Mobile Manipulation
Figure 3 for HomeRobot: Open-Vocabulary Mobile Manipulation
Figure 4 for HomeRobot: Open-Vocabulary Mobile Manipulation

HomeRobot (noun): An affordable compliant robot that navigates homes and manipulates a wide range of objects in order to complete everyday tasks. Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location. This is a foundational challenge for robots to be useful assistants in human environments, because it involves tackling sub-problems from across robotics: perception, language understanding, navigation, and manipulation are all essential to OVMM. In addition, integration of the solutions to these sub-problems poses its own substantial challenges. To drive research in this area, we introduce the HomeRobot OVMM benchmark, where an agent navigates household environments to grasp novel objects and place them on target receptacles. HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments; and a real-world component, providing a software stack for the low-cost Hello Robot Stretch to encourage replication of real-world experiments across labs. We implement both reinforcement learning and heuristic (model-based) baselines and show evidence of sim-to-real transfer. Our baselines achieve a 20% success rate in the real world; our experiments identify ways future research work improve performance. See videos on our website: https://ovmm.github.io/.

* 35 pages, 20 figures, 8 tables 
Viaarxiv icon

HePCo: Data-Free Heterogeneous Prompt Consolidation for Continual Federated Learning

Jun 16, 2023
Shaunak Halbe, James Seale Smith, Junjiao Tian, Zsolt Kira

Figure 1 for HePCo: Data-Free Heterogeneous Prompt Consolidation for Continual Federated Learning
Figure 2 for HePCo: Data-Free Heterogeneous Prompt Consolidation for Continual Federated Learning
Figure 3 for HePCo: Data-Free Heterogeneous Prompt Consolidation for Continual Federated Learning
Figure 4 for HePCo: Data-Free Heterogeneous Prompt Consolidation for Continual Federated Learning

In this paper, we focus on the important yet understudied problem of Continual Federated Learning (CFL), where a server communicates with a set of clients to incrementally learn new concepts over time without sharing or storing any data. The complexity of this problem is compounded by challenges from both the Continual and Federated Learning perspectives. Specifically, models trained in a CFL setup suffer from catastrophic forgetting which is exacerbated by data heterogeneity across clients. Existing attempts at this problem tend to impose large overheads on clients and communication channels or require access to stored data which renders them unsuitable for real-world use due to privacy. In this paper, we attempt to tackle forgetting and heterogeneity while minimizing overhead costs and without requiring access to any stored data. We achieve this by leveraging a prompting based approach (such that only prompts and classifier heads have to be communicated) and proposing a novel and lightweight generation and distillation scheme to consolidate client models at the server. We formulate this problem for image classification and establish strong baselines for comparison, conduct experiments on CIFAR-100 as well as challenging, large-scale datasets like ImageNet-R and DomainNet. Our approach outperforms both existing methods and our own baselines by as much as 7% while significantly reducing communication and client-level computation costs.

Viaarxiv icon

Adaptive Coordination in Social Embodied Rearrangement

May 31, 2023
Andrew Szot, Unnat Jain, Dhruv Batra, Zsolt Kira, Ruta Desai, Akshara Rai

Figure 1 for Adaptive Coordination in Social Embodied Rearrangement
Figure 2 for Adaptive Coordination in Social Embodied Rearrangement
Figure 3 for Adaptive Coordination in Social Embodied Rearrangement
Figure 4 for Adaptive Coordination in Social Embodied Rearrangement

We present the task of "Social Rearrangement", consisting of cooperative everyday tasks like setting up the dinner table, tidying a house or unpacking groceries in a simulated multi-agent environment. In Social Rearrangement, two robots coordinate to complete a long-horizon task, using onboard sensing and egocentric observations, and no privileged information about the environment. We study zero-shot coordination (ZSC) in this task, where an agent collaborates with a new partner, emulating a scenario where a robot collaborates with a new human partner. Prior ZSC approaches struggle to generalize in our complex and visually rich setting, and on further analysis, we find that they fail to generate diverse coordination behaviors at training time. To counter this, we propose Behavior Diversity Play (BDP), a novel ZSC approach that encourages diversity through a discriminability objective. Our results demonstrate that BDP learns adaptive agents that can tackle visual coordination, and zero-shot generalize to new partners in unseen environments, achieving 35% higher success and 32% higher efficiency compared to baselines.

Viaarxiv icon

HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning

May 25, 2023
Chia-Wen Kuo, Zsolt Kira

Figure 1 for HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Figure 2 for HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Figure 3 for HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Figure 4 for HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning

A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings? In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model's data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level. We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts, and conduct rigorous analyses to demonstrate the importance of each part of our design.

* Paper accepted in CVPR-23; Project page and code available here: https://sites.google.com/view/chiawen-kuo/home/haav 
Viaarxiv icon

Training Energy-Based Normalizing Flow with Score-Matching Objectives

May 24, 2023
Chen-Hao Chao, Wei-Fang Sun, Yen-Chang Hsu, Zsolt Kira, Chun-Yi Lee

Figure 1 for Training Energy-Based Normalizing Flow with Score-Matching Objectives
Figure 2 for Training Energy-Based Normalizing Flow with Score-Matching Objectives
Figure 3 for Training Energy-Based Normalizing Flow with Score-Matching Objectives
Figure 4 for Training Energy-Based Normalizing Flow with Score-Matching Objectives

In this paper, we establish a connection between the parameterization of flow-based and energy-based generative models, and present a new flow-based modeling approach called energy-based normalizing flow (EBFlow). We demonstrate that by optimizing EBFlow with score-matching objectives, the computation of Jacobian determinants for linear transformations can be entirely bypassed. This feature enables the use of arbitrary linear layers in the construction of flow-based models without increasing the computational time complexity of each training iteration from $\mathcal{O}(D^2L)$ to $\mathcal{O}(D^3L)$ for an $L$-layered model that accepts $D$-dimensional inputs. This makes the training of EBFlow more efficient than the commonly-adopted maximum likelihood training method. In addition to the reduction in runtime, we enhance the training stability and empirical performance of EBFlow through a number of techniques developed based on our analysis on the score-matching methods. The experimental results demonstrate that our approach achieves a significant speedup compared to maximum likelihood estimation, while outperforming prior efficient training techniques with a noticeable margin in terms of negative log-likelihood (NLL).

Viaarxiv icon

CLIP-GCD: Simple Language Guided Generalized Category Discovery

May 17, 2023
Rabah Ouldnoughi, Chia-Wen Kuo, Zsolt Kira

Figure 1 for CLIP-GCD: Simple Language Guided Generalized Category Discovery
Figure 2 for CLIP-GCD: Simple Language Guided Generalized Category Discovery
Figure 3 for CLIP-GCD: Simple Language Guided Generalized Category Discovery
Figure 4 for CLIP-GCD: Simple Language Guided Generalized Category Discovery

Generalized Category Discovery (GCD) requires a model to both classify known categories and cluster unknown categories in unlabeled data. Prior methods leveraged self-supervised pre-training combined with supervised fine-tuning on the labeled data, followed by simple clustering methods. In this paper, we posit that such methods are still prone to poor performance on out-of-distribution categories, and do not leverage a key ingredient: Semantic relationships between object categories. We therefore propose to leverage multi-modal (vision and language) models, in two complementary ways. First, we establish a strong baseline by replacing uni-modal features with CLIP, inspired by its zero-shot performance. Second, we propose a novel retrieval-based mechanism that leverages CLIP's aligned vision-language representations by mining text descriptions from a text corpus for the labeled and unlabeled set. We specifically use the alignment between CLIP's visual encoding of the image and textual encoding of the corpus to retrieve top-k relevant pieces of text and incorporate their embeddings to perform joint image+text semi-supervised clustering. We perform rigorous experimentation and ablations (including on where to retrieve from, how much to retrieve, and how to combine information), and validate our results on several datasets including out-of-distribution domains, demonstrating state-of-art results.

Viaarxiv icon

We Need to Talk: Identifying and Overcoming Communication-Critical Scenarios for Self-Driving

May 07, 2023
Nathaniel Moore Glaser, Zsolt Kira

Figure 1 for We Need to Talk: Identifying and Overcoming Communication-Critical Scenarios for Self-Driving
Figure 2 for We Need to Talk: Identifying and Overcoming Communication-Critical Scenarios for Self-Driving
Figure 3 for We Need to Talk: Identifying and Overcoming Communication-Critical Scenarios for Self-Driving
Figure 4 for We Need to Talk: Identifying and Overcoming Communication-Critical Scenarios for Self-Driving

In this work, we consider the task of collision-free trajectory planning for connected self-driving vehicles. We specifically consider communication-critical situations--situations where single-agent systems have blindspots that require multi-agent collaboration. To identify such situations, we propose a method which (1) simulates multi-agent perspectives from real self-driving datasets, (2) finds scenarios that are challenging for isolated agents, and (3) augments scenarios with adversarial obstructions. To overcome these challenges, we propose to extend costmap-based trajectory evaluation to a distributed multi-agent setting. We demonstrate that our bandwidth-efficient, uncertainty-aware method reduces collision rates by up to 62.5% compared to single agent baselines.

* Submitted to ICRA 2023 Workshop on Collaborative Perception 
Viaarxiv icon