Alert button
Picture for Deva Ramanan

Deva Ramanan

Alert button

Language Models as Black-Box Optimizers for Vision-Language Models

Sep 12, 2023
Samuel Yu, Shihong Liu, Zhiqiu Lin, Deepak Pathak, Deva Ramanan

Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities across a variety of vision and multimodal tasks. Currently, fine-tuning methods for VLMs mainly operate in a white-box setting, requiring access to model parameters for backpropagation. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. Given that popular private large language models (LLMs) like ChatGPT still offer a language-based user interface, we aim to develop a novel fine-tuning approach for VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or output logits. In this setup, we propose employing chat-based LLMs as black-box optimizers to search for the best text prompt on the illustrative task of few-shot image classification using CLIP. Specifically, we adopt an automatic "hill-climbing" procedure that converges on an effective prompt by evaluating the accuracy of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot learning setup, our simple approach surpasses the white-box continuous prompting method CoOp by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms OpenAI's manually crafted prompts and is more efficient than other black-box methods like iterative APE. Additionally, we highlight the advantage of conversational feedback incorporating both positive and negative prompts, suggesting that LLMs can utilize the implicit "gradient" direction in textual feedback for a more efficient search. Lastly, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different CLIP architectures in a black-box manner.

Viaarxiv icon

Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis

Aug 18, 2023
Jonathon Luiten, Georgios Kopanas, Bastian Leibe, Deva Ramanan

Figure 1 for Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis
Figure 2 for Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis
Figure 3 for Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis
Figure 4 for Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis

We present a method that simultaneously addresses the tasks of dynamic scene novel-view synthesis and six degree-of-freedom (6-DOF) tracking of all dense scene elements. We follow an analysis-by-synthesis framework, inspired by recent work that models scenes as a collection of 3D Gaussians which are optimized to reconstruct input images via differentiable rendering. To model dynamic scenes, we allow Gaussians to move and rotate over time while enforcing that they have persistent color, opacity, and size. By regularizing Gaussians' motion and rotation with local-rigidity constraints, we show that our Dynamic 3D Gaussians correctly model the same area of physical space over time, including the rotation of that space. Dense 6-DOF tracking and dynamic reconstruction emerges naturally from persistent dynamic view synthesis, without requiring any correspondence or flow as input. We demonstrate a large number of downstream applications enabled by our representation, including first-person view synthesis, dynamic compositional scene synthesis, and 4D video editing.

Viaarxiv icon

Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

Aug 17, 2023
Shengcao Cao, Mengtian Li, James Hays, Deva Ramanan, Yi-Xiong Wang, Liang-Yan Gui

Figure 1 for Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation
Figure 2 for Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation
Figure 3 for Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation
Figure 4 for Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

Resource-constrained perception systems such as edge computing and vision-for-robotics require vision models to be both accurate and lightweight in computation and memory usage. While knowledge distillation is a proven strategy to enhance the performance of lightweight classification models, its application to structured outputs like object detection and instance segmentation remains a complicated task, due to the variability in outputs and complex internal network modules involved in the distillation process. In this paper, we propose a simple yet surprisingly effective sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student. To distill knowledge from a highly accurate but complex teacher model, we construct a sequence of teachers to help the student gradually adapt. Our progressive strategy can be easily combined with existing detection distillation mechanisms to consistently maximize student performance in various settings. To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students, and unprecedentedly boost the performance of ResNet-50 based RetinaNet from 36.5% to 42.0% AP and Mask R-CNN from 38.2% to 42.5% AP on the MS COCO benchmark.

* ICML 2023 
Viaarxiv icon

An Empirical Analysis of Range for 3D Object Detection

Aug 08, 2023
Neehar Peri, Mengtian Li, Benjamin Wilson, Yu-Xiong Wang, James Hays, Deva Ramanan

Figure 1 for An Empirical Analysis of Range for 3D Object Detection
Figure 2 for An Empirical Analysis of Range for 3D Object Detection
Figure 3 for An Empirical Analysis of Range for 3D Object Detection
Figure 4 for An Empirical Analysis of Range for 3D Object Detection

LiDAR-based 3D detection plays a vital role in autonomous navigation. Surprisingly, although autonomous vehicles (AVs) must detect both near-field objects (for collision avoidance) and far-field objects (for longer-term planning), contemporary benchmarks focus only on near-field 3D detection. However, AVs must detect far-field objects for safe navigation. In this paper, we present an empirical analysis of far-field 3D detection using the long-range detection dataset Argoverse 2.0 to better understand the problem, and share the following insight: near-field LiDAR measurements are dense and optimally encoded by small voxels, while far-field measurements are sparse and are better encoded with large voxels. We exploit this observation to build a collection of range experts tuned for near-vs-far field detection, and propose simple techniques to efficiently ensemble models for long-range detection that improve efficiency by 33% and boost accuracy by 3.2% CDS.

* Accepted to ICCV 2023 Workshop - Robustness and Reliability of Autonomous Vehicles in the Open-World 
Viaarxiv icon

Thinking Like an Annotator: Generation of Dataset Labeling Instructions

Jun 24, 2023
Nadine Chang, Francesco Ferroni, Michael J. Tarr, Martial Hebert, Deva Ramanan

Large-scale datasets are essential to modern day deep learning. Advocates argue that understanding these methods requires dataset transparency (e.g. "dataset curation, motivation, composition, collection process, etc..."). However, almost no one has suggested the release of the detailed definitions and visual category examples provided to annotators - information critical to understanding the structure of the annotations present in each dataset. These labels are at the heart of public datasets, yet few datasets include the instructions that were used to generate them. We introduce a new task, Labeling Instruction Generation, to address missing publicly available labeling instructions. In Labeling Instruction Generation, we take a reasonably annotated dataset and: 1) generate a set of examples that are visually representative of each category in the dataset; 2) provide a text label that corresponds to each of the examples. We introduce a framework that requires no model training to solve this task and includes a newly created rapid retrieval system that leverages a large, pre-trained vision and language model. This framework acts as a proxy to human annotators that can help to both generate a final labeling instruction set and evaluate its quality. Our framework generates multiple diverse visual and text representations of dataset categories. The optimized instruction set outperforms our strongest baseline across 5 folds by 7.06 mAP for NuImages and 12.9 mAP for COCO.

Viaarxiv icon

VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores

Jun 02, 2023
Zhiqiu Lin, Xinyue Chen, Deepak Pathak, Pengchuan Zhang, Deva Ramanan

Figure 1 for VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores
Figure 2 for VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores
Figure 3 for VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores
Figure 4 for VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores

Vision-language models (VLMs) discriminatively pre-trained with contrastive image-text matching losses such as $P(\text{match}|\text{text}, \text{image})$ have been criticized for lacking compositional understanding. This means they might output similar scores even if the original caption is rearranged into a different semantic statement. To address this, we propose to use the ${\bf V}$isual ${\bf G}$enerative ${\bf P}$re-${\bf T}$raining Score (${\bf VisualGPTScore}$) of $P(\text{text}|\text{image})$, a $\textit{multimodal generative}$ score that captures the likelihood of a text caption conditioned on an image using an image-conditioned language model. Contrary to the belief that VLMs are mere bag-of-words models, our off-the-shelf VisualGPTScore demonstrates top-tier performance on recently proposed image-text retrieval benchmarks like ARO and Crepe that assess compositional reasoning. Furthermore, we factorize VisualGPTScore into a product of the $\textit{marginal}$ P(text) and the $\textit{Pointwise Mutual Information}$ (PMI). This helps to (a) diagnose datasets with strong language bias, and (b) debias results on other benchmarks like Winoground using an information-theoretic framework. VisualGPTScore provides valuable insights and serves as a strong baseline for future evaluation of visio-linguistic compositionality.

* Website: https://linzhiqiu.github.io/papers/visual_gpt_score/ Code: https://github.com/linzhiqiu/visual_gpt_score/ 
Viaarxiv icon

ZeroFlow: Fast Zero Label Scene Flow via Distillation

May 23, 2023
Kyle Vedder, Neehar Peri, Nathaniel Chodosh, Ishan Khatri, Eric Eaton, Dinesh Jayaraman, Yang Liu, Deva Ramanan, James Hays

Figure 1 for ZeroFlow: Fast Zero Label Scene Flow via Distillation
Figure 2 for ZeroFlow: Fast Zero Label Scene Flow via Distillation
Figure 3 for ZeroFlow: Fast Zero Label Scene Flow via Distillation
Figure 4 for ZeroFlow: Fast Zero Label Scene Flow via Distillation

Scene flow estimation is the task of describing the 3D motion field between temporally successive point clouds. State-of-the-art methods use strong priors and test-time optimization techniques, but require on the order of tens of seconds for large-scale point clouds, making them unusable as computer vision primitives for real-time applications such as open world object detection. Feed forward methods are considerably faster, running on the order of tens to hundreds of milliseconds for large-scale point clouds, but require expensive human supervision. To address both limitations, we propose Scene Flow via Distillation, a simple distillation framework that uses a label-free optimization method to produce pseudo-labels to supervise a feed forward model. Our instantiation of this framework, ZeroFlow, produces scene flow estimates in real-time on large-scale point clouds at quality competitive with state-of-the-art methods while using zero human labels. Notably, at test-time ZeroFlow is over 1000$\times$ faster than label-free state-of-the-art optimization-based methods on large-scale point clouds and over 1000$\times$ cheaper to train on unlabeled data compared to the cost of human annotation of that data. To facilitate research reuse, we release our code, trained model weights, and high quality pseudo-labels for the Argoverse 2 and Waymo Open datasets.

* 9 pages, 4 pages of Supplemental 
Viaarxiv icon

WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models

May 12, 2023
Aboli Marathe, Deva Ramanan, Rahee Walambe, Ketan Kotecha

Figure 1 for WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models
Figure 2 for WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models
Figure 3 for WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models
Figure 4 for WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models

The open road poses many challenges to autonomous perception, including poor visibility from extreme weather conditions. Models trained on good-weather datasets frequently fail at detection in these out-of-distribution settings. To aid adversarial robustness in perception, we introduce WEDGE (WEather images by DALL-E GEneration): a synthetic dataset generated with a vision-language generative model via prompting. WEDGE consists of 3360 images in 16 extreme weather conditions manually annotated with 16513 bounding boxes, supporting research in the tasks of weather classification and 2D object detection. We have analyzed WEDGE from research standpoints, verifying its effectiveness for extreme-weather autonomous perception. We establish baseline performance for classification and detection with 53.87% test accuracy and 45.41 mAP. Most importantly, WEDGE can be used to fine-tune state-of-the-art detectors, improving SOTA performance on real-world weather benchmarks (such as DAWN) by 4.48 AP for well-generated classes like trucks. WEDGE has been collected under OpenAI's terms of use and is released for public use under the CC BY-NC-SA 4.0 license. The repository for this work and dataset is available at https://infernolia.github.io/WEDGE.

* Accepted in Vision Datasets Understanding at CVPR 2023 
Viaarxiv icon

Reconstructing Animatable Categories from Videos

May 10, 2023
Gengshan Yang, Chaoyang Wang, N Dinesh Reddy, Deva Ramanan

Figure 1 for Reconstructing Animatable Categories from Videos
Figure 2 for Reconstructing Animatable Categories from Videos
Figure 3 for Reconstructing Animatable Categories from Videos
Figure 4 for Reconstructing Animatable Categories from Videos

Building animatable 3D models is challenging due to the need for 3D scans, laborious registration, and manual rigging, which are difficult to scale to arbitrary categories. Recently, differentiable rendering provides a pathway to obtain high-quality 3D models from monocular videos, but these are limited to rigid categories or single instances. We present RAC that builds category 3D models from monocular videos while disentangling variations over instances and motion over time. Three key ideas are introduced to solve this problem: (1) specializing a skeleton to instances via optimization, (2) a method for latent space regularization that encourages shared structure across a category while maintaining instance details, and (3) using 3D background models to disentangle objects from the background. We show that 3D models of humans, cats, and dogs can be learned from 50-100 internet videos.

* Project page: https://gengshan-y.github.io/rac-www/ 
Viaarxiv icon

Joint Metrics Matter: A Better Standard for Trajectory Forecasting

May 10, 2023
Erica Weng, Hana Hoshino, Deva Ramanan, Kris Kitani

Figure 1 for Joint Metrics Matter: A Better Standard for Trajectory Forecasting
Figure 2 for Joint Metrics Matter: A Better Standard for Trajectory Forecasting
Figure 3 for Joint Metrics Matter: A Better Standard for Trajectory Forecasting
Figure 4 for Joint Metrics Matter: A Better Standard for Trajectory Forecasting

Multi-modal trajectory forecasting methods commonly evaluate using single-agent metrics (marginal metrics), such as minimum Average Displacement Error (ADE) and Final Displacement Error (FDE), which fail to capture joint performance of multiple interacting agents. Only focusing on marginal metrics can lead to unnatural predictions, such as colliding trajectories or diverging trajectories for people who are clearly walking together as a group. Consequently, methods optimized for marginal metrics lead to overly-optimistic estimations of performance, which is detrimental to progress in trajectory forecasting research. In response to the limitations of marginal metrics, we present the first comprehensive evaluation of state-of-the-art (SOTA) trajectory forecasting methods with respect to multi-agent metrics (joint metrics): JADE, JFDE, and collision rate. We demonstrate the importance of joint metrics as opposed to marginal metrics with quantitative evidence and qualitative examples drawn from the ETH / UCY and Stanford Drone datasets. We introduce a new loss function incorporating joint metrics that, when applied to a SOTA trajectory forecasting method, achieves a 7% improvement in JADE / JFDE on the ETH / UCY datasets with respect to the previous SOTA. Our results also indicate that optimizing for joint metrics naturally leads to an improvement in interaction modeling, as evidenced by a 16% decrease in mean collision rate on the ETH / UCY datasets with respect to the previous SOTA.

* 10 figures 
Viaarxiv icon