Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alireza Fathi

Improving Image Recognition by Retrieving from Web-Scale Image-Text Data

Apr 11, 2023

Ahmet Iscen, Alireza Fathi, Cordelia Schmid

Abstract:Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems. The goal is to enhance the recognition capabilities of the model by retrieving similar examples for the visual input from an external memory set. In this work, we introduce an attention-based memory module, which learns the importance of each retrieved example from the memory. Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query. We also thoroughly study various ways of constructing the memory dataset. Our experiments show the benefit of using a massive-scale memory dataset of 1B image-text pairs, and demonstrate the performance of different memory representations. We evaluate our method in three different classification tasks, namely long-tailed recognition, learning with noisy labels, and fine-grained classification, and show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.

* Accepted to CVPR 2023

Via

Access Paper or Ask Questions

Learning Object-Centric Neural Scattering Functions for Free-viewpoint Relighting and Scene Composition

Mar 10, 2023

Hong-Xing Yu, Michelle Guo, Alireza Fathi, Yen-Yu Chang, Eric Ryan Chan, Ruohan Gao, Thomas Funkhouser, Jiajun Wu

Figure 1 for Learning Object-Centric Neural Scattering Functions for Free-viewpoint Relighting and Scene Composition

Figure 2 for Learning Object-Centric Neural Scattering Functions for Free-viewpoint Relighting and Scene Composition

Figure 3 for Learning Object-Centric Neural Scattering Functions for Free-viewpoint Relighting and Scene Composition

Figure 4 for Learning Object-Centric Neural Scattering Functions for Free-viewpoint Relighting and Scene Composition

Abstract:Photorealistic object appearance modeling from 2D images is a constant topic in vision and graphics. While neural implicit methods (such as Neural Radiance Fields) have shown high-fidelity view synthesis results, they cannot relight the captured objects. More recent neural inverse rendering approaches have enabled object relighting, but they represent surface properties as simple BRDFs, and therefore cannot handle translucent objects. We propose Object-Centric Neural Scattering Functions (OSFs) for learning to reconstruct object appearance from only images. OSFs not only support free-viewpoint object relighting, but also can model both opaque and translucent objects. While accurately modeling subsurface light transport for translucent objects can be highly complex and even intractable for neural methods, OSFs learn to approximate the radiance transfer from a distant light to an outgoing direction at any spatial location. This approximation avoids explicitly modeling complex subsurface scattering, making learning a neural implicit model tractable. Experiments on real and synthetic data show that OSFs accurately reconstruct appearances for both opaque and translucent objects, allowing faithful free-viewpoint relighting as well as scene composition.

* Journal extension of arXiv:2012.08503. The first two authors contributed equally to this work

Via

Access Paper or Ask Questions

REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

Dec 10, 2022

Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, Alireza Fathi

Figure 1 for REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

Figure 2 for REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

Figure 3 for REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

Figure 4 for REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

Abstract:In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries. REVEAL consists of four key components: the memory, the encoder, the retriever and the generator. The large-scale memory encodes various sources of multimodal world knowledge (e.g. image-text pairs, question answering pairs, knowledge graph triplets, etc) via a unified encoder. The retriever finds the most relevant knowledge entries in the memory, and the generator fuses the retrieved knowledge with the input query to produce the output. A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data. Furthermore, our approach can use a diverse set of multimodal knowledge sources, which is shown to result in significant gains. We show that REVEAL achieves state-of-the-art results on visual question answering and image captioning.

Via

Access Paper or Ask Questions

A Memory Transformer Network for Incremental Learning

Oct 10, 2022

Ahmet Iscen, Thomas Bird, Mathilde Caron, Alireza Fathi, Cordelia Schmid

Figure 1 for A Memory Transformer Network for Incremental Learning

Figure 2 for A Memory Transformer Network for Incremental Learning

Figure 3 for A Memory Transformer Network for Incremental Learning

Figure 4 for A Memory Transformer Network for Incremental Learning

Abstract:We study class-incremental learning, a training setup in which new classes of data are observed over time for the model to learn from. Despite the straightforward problem formulation, the naive application of classification models to class-incremental learning results in the "catastrophic forgetting" of previously seen classes. One of the most successful existing methods has been the use of a memory of exemplars, which overcomes the issue of catastrophic forgetting by saving a subset of past data into a memory bank and utilizing it to prevent forgetting when training future tasks. In our paper, we propose to enhance the utilization of this memory bank: we not only use it as a source of additional training data like existing works but also integrate it in the prediction process explicitly.Our method, the Memory Transformer Network (MTN), learns how to combine and aggregate the information from the nearest neighbors in the memory with a transformer to make more accurate predictions. We conduct extensive experiments and ablations to evaluate our approach. We show that MTN achieves state-of-the-art performance on the challenging ImageNet-1k and Google-Landmarks-1k incremental learning benchmarks.

Via

Access Paper or Ask Questions

im2nerf: Image to Neural Radiance Field in the Wild

Sep 08, 2022

Lu Mi, Abhijit Kundu, David Ross, Frank Dellaert, Noah Snavely, Alireza Fathi

Figure 1 for im2nerf: Image to Neural Radiance Field in the Wild

Figure 2 for im2nerf: Image to Neural Radiance Field in the Wild

Figure 3 for im2nerf: Image to Neural Radiance Field in the Wild

Figure 4 for im2nerf: Image to Neural Radiance Field in the Wild

Abstract:We propose im2nerf, a learning framework that predicts a continuous neural object representation given a single input image in the wild, supervised by only segmentation output from off-the-shelf recognition methods. The standard approach to constructing neural radiance fields takes advantage of multi-view consistency and requires many calibrated views of a scene, a requirement that cannot be satisfied when learning on large-scale image data in the wild. We take a step towards addressing this shortcoming by introducing a model that encodes the input image into a disentangled object representation that contains a code for object shape, a code for object appearance, and an estimated camera pose from which the object image is captured. Our model conditions a NeRF on the predicted object representation and uses volume rendering to generate images from novel views. We train the model end-to-end on a large collection of input images. As the model is only provided with single-view images, the problem is highly under-constrained. Therefore, in addition to using a reconstruction loss on the synthesized input view, we use an auxiliary adversarial loss on the novel rendered views. Furthermore, we leverage object symmetry and cycle camera pose consistency. We conduct extensive quantitative and qualitative experiments on the ShapeNet dataset as well as qualitative experiments on Open Images dataset. We show that in all cases, im2nerf achieves the state-of-the-art performance for novel view synthesis from a single-view unposed image in the wild.

* 12 pages, 8 figures, 4 tables

Via

Access Paper or Ask Questions

Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation

May 09, 2022

Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Caroline Pantofaru, Leonidas Guibas, Andrea Tagliasacchi, Frank Dellaert, Thomas Funkhouser

Figure 1 for Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation

Figure 2 for Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation

Figure 3 for Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation

Figure 4 for Panoptic Neural Fields: A Semantic Object-Aware Neural Scene Representation

Abstract:We present Panoptic Neural Fields (PNF), an object-aware neural scene representation that decomposes a scene into a set of objects (things) and background (stuff). Each object is represented by an oriented 3D bounding box and a multi-layer perceptron (MLP) that takes position, direction, and time and outputs density and radiance. The background stuff is represented by a similar MLP that additionally outputs semantic labels. Each object MLPs are instance-specific and thus can be smaller and faster than previous object-aware approaches, while still leveraging category-specific priors incorporated via meta-learned initialization. Our model builds a panoptic radiance field representation of any scene from just color images. We use off-the-shelf algorithms to predict camera poses, object tracks, and 2D image semantic segmentations. Then we jointly optimize the MLP weights and bounding box parameters using analysis-by-synthesis with self-supervision from color images and pseudo-supervision from predicted semantic segmentations. During experiments with real-world dynamic scenes, we find that our model can be used effectively for several tasks like novel view synthesis, 2D panoptic segmentation, 3D scene editing, and multiview depth prediction.

* CVPR 2022 paper. See project page at https://abhijitkundu.info/projects/pnf

Via

Access Paper or Ask Questions

PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map

Apr 21, 2022

Chenfeng Xu, Tian Li, Chen Tang, Lingfeng Sun, Kurt Keutzer, Masayoshi Tomizuka, Alireza Fathi, Wei Zhan

Figure 1 for PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map

Figure 2 for PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map

Figure 3 for PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map

Figure 4 for PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map

Abstract:Deep learning has recently achieved significant progress in trajectory forecasting. However, the scarcity of trajectory data inhibits the data-hungry deep-learning models from learning good representations. While mature representation learning methods exist in computer vision and natural language processing, these pre-training methods require large-scale data. It is hard to replicate these approaches in trajectory forecasting due to the lack of adequate trajectory data (e.g., 34K samples in the nuScenes dataset). To work around the scarcity of trajectory data, we resort to another data modality closely related to trajectories-HD-maps, which is abundantly provided in existing datasets. In this paper, we propose PreTraM, a self-supervised pre-training scheme via connecting trajectories and maps for trajectory forecasting. Specifically, PreTraM consists of two parts: 1) Trajectory-Map Contrastive Learning, where we project trajectories and maps to a shared embedding space with cross-modal contrastive learning, and 2) Map Contrastive Learning, where we enhance map representation with contrastive learning on large quantities of HD-maps. On top of popular baselines such as AgentFormer and Trajectron++, PreTraM boosts their performance by 5.5% and 6.9% relatively in FDE-10 on the challenging nuScenes dataset. We show that PreTraM improves data efficiency and scales well with model size.

* The code is available at https://github.com/chenfengxu714/PreTraM.git

Via

Access Paper or Ask Questions

Object-Centric Neural Scene Rendering

Dec 15, 2020

Michelle Guo, Alireza Fathi, Jiajun Wu, Thomas Funkhouser

Figure 1 for Object-Centric Neural Scene Rendering

Figure 2 for Object-Centric Neural Scene Rendering

Figure 3 for Object-Centric Neural Scene Rendering

Figure 4 for Object-Centric Neural Scene Rendering

Abstract:We present a method for composing photorealistic scenes from captured images of objects. Our work builds upon neural radiance fields (NeRFs), which implicitly model the volumetric density and directionally-emitted radiance of a scene. While NeRFs synthesize realistic pictures, they only model static scenes and are closely tied to specific imaging conditions. This property makes NeRFs hard to generalize to new scenarios, including new lighting or new arrangements of objects. Instead of learning a scene radiance field as a NeRF does, we propose to learn object-centric neural scattering functions (OSFs), a representation that models per-object light transport implicitly using a lighting- and view-dependent neural network. This enables rendering scenes even when objects or lights move, without retraining. Combined with a volumetric path tracing procedure, our framework is capable of rendering both intra- and inter-object light transport effects including occlusions, specularities, shadows, and indirect illumination. We evaluate our approach on scene composition and show that it generalizes to novel illumination conditions, producing photorealistic, physically accurate renderings of multi-object scenes.

* Summary Video: https://youtu.be/NtR7xgxSL1U Project Webpage: https://shellguo.com/osf

Via

Access Paper or Ask Questions

Multi-Frame to Single-Frame: Knowledge Distillation for 3D Object Detection

Sep 24, 2020

Yue Wang, Alireza Fathi, Jiajun Wu, Thomas Funkhouser, Justin Solomon

Figure 1 for Multi-Frame to Single-Frame: Knowledge Distillation for 3D Object Detection

Figure 2 for Multi-Frame to Single-Frame: Knowledge Distillation for 3D Object Detection

Figure 3 for Multi-Frame to Single-Frame: Knowledge Distillation for 3D Object Detection

Figure 4 for Multi-Frame to Single-Frame: Knowledge Distillation for 3D Object Detection

Abstract:A common dilemma in 3D object detection for autonomous driving is that high-quality, dense point clouds are only available during training, but not testing. We use knowledge distillation to bridge the gap between a model trained on high-quality inputs at training time and another tested on low-quality inputs at inference time. In particular, we design a two-stage training pipeline for point cloud object detection. First, we train an object detection model on dense point clouds, which are generated from multiple frames using extra information only available at training time. Then, we train the model's identical counterpart on sparse single-frame point clouds with consistency regularization on features from both models. We show that this procedure improves performance on low-quality data during testing, without additional overhead.

* The Workshop on Perception for Autonomous Driving at ECCV2020

Via

Access Paper or Ask Questions

Pillar-based Object Detection for Autonomous Driving

Jul 26, 2020

Yue Wang, Alireza Fathi, Abhijit Kundu, David Ross, Caroline Pantofaru, Thomas Funkhouser, Justin Solomon

Figure 1 for Pillar-based Object Detection for Autonomous Driving

Figure 2 for Pillar-based Object Detection for Autonomous Driving

Figure 3 for Pillar-based Object Detection for Autonomous Driving

Figure 4 for Pillar-based Object Detection for Autonomous Driving

Abstract:We present a simple and flexible object detection framework optimized for autonomous driving. Building on the observation that point clouds in this application are extremely sparse, we propose a practical pillar-based approach to fix the imbalance issue caused by anchors. In particular, our algorithm incorporates a cylindrical projection into multi-view feature learning, predicts bounding box parameters per pillar rather than per point or per anchor, and includes an aligned pillar-to-point projection module to improve the final prediction. Our anchor-free approach avoids hyperparameter search associated with past methods, simplifying 3D object detection while significantly improving upon state-of-the-art.

* Accepted to ECCV2020

Via

Access Paper or Ask Questions