Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bharath Hariharan

Cornell University

Visual Prompt Tuning

Mar 23, 2022

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, Ser-Nam Lim

Abstract:The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, ie, full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning large language models, VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen. Via extensive experiments on a wide variety of downstream recognition tasks, we show that VPT achieves significant performance gains compared to other parameter efficient tuning protocols. Most importantly, VPT even outperforms full fine-tuning in many cases across model capacities and training data scales, while reducing per-task storage cost.

* Tech report

Via

Access Paper or Ask Questions

Hindsight is 20/20: Leveraging Past Traversals to Aid 3D Perception

Mar 22, 2022

Yurong You, Katie Z Luo, Xiangyu Chen, Junan Chen, Wei-Lun Chao, Wen Sun, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger

Figure 1 for Hindsight is 20/20: Leveraging Past Traversals to Aid 3D Perception

Figure 2 for Hindsight is 20/20: Leveraging Past Traversals to Aid 3D Perception

Figure 3 for Hindsight is 20/20: Leveraging Past Traversals to Aid 3D Perception

Figure 4 for Hindsight is 20/20: Leveraging Past Traversals to Aid 3D Perception

Abstract:Self-driving cars must detect vehicles, pedestrians, and other traffic participants accurately to operate safely. Small, far-away, or highly occluded objects are particularly challenging because there is limited information in the LiDAR point clouds for detecting them. To address this challenge, we leverage valuable information from the past: in particular, data collected in past traversals of the same scene. We posit that these past data, which are typically discarded, provide rich contextual information for disambiguating the above-mentioned challenging cases. To this end, we propose a novel, end-to-end trainable Hindsight framework to extract this contextual information from past traversals and store it in an easy-to-query data structure, which can then be leveraged to aid future 3D object detection of the same scene. We show that this framework is compatible with most modern 3D detection architectures and can substantially improve their average precision on multiple autonomous driving datasets, most notably by more than 300% on the challenging cases.

* Accepted by ICLR 2022. Code is available at https://github.com/YurongYou/Hindsight

Via

Access Paper or Ask Questions

Unsupervised Semantic Segmentation by Distilling Feature Correspondences

Mar 16, 2022

Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, William T. Freeman

Figure 1 for Unsupervised Semantic Segmentation by Distilling Feature Correspondences

Figure 2 for Unsupervised Semantic Segmentation by Distilling Feature Correspondences

Figure 3 for Unsupervised Semantic Segmentation by Distilling Feature Correspondences

Figure 4 for Unsupervised Semantic Segmentation by Distilling Feature Correspondences

Abstract:Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation. To solve this task, algorithms must produce features for every pixel that are both semantically meaningful and compact enough to form distinct clusters. Unlike previous works which achieve this with a single end-to-end framework, we propose to separate feature learning from cluster compactification. Empirically, we show that current unsupervised feature learning frameworks already generate dense features whose correlations are semantically consistent. This observation motivates us to design STEGO ($\textbf{S}$elf-supervised $\textbf{T}$ransformer with $\textbf{E}$nergy-based $\textbf{G}$raph $\textbf{O}$ptimization), a novel framework that distills unsupervised features into high-quality discrete semantic labels. At the core of STEGO is a novel contrastive loss function that encourages features to form compact clusters while preserving their relationships across the corpora. STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff ($\textbf{+14 mIoU}$) and Cityscapes ($\textbf{+9 mIoU}$) semantic segmentation challenges.

Via

Access Paper or Ask Questions

Orientation-Discriminative Feature Representation for Decentralized Pedestrian Tracking

Feb 26, 2022

Vikram Shree, Carlos Diaz-Ruiz, Chang Liu, Bharath Hariharan, Mark Campbell

Figure 1 for Orientation-Discriminative Feature Representation for Decentralized Pedestrian Tracking

Figure 2 for Orientation-Discriminative Feature Representation for Decentralized Pedestrian Tracking

Figure 3 for Orientation-Discriminative Feature Representation for Decentralized Pedestrian Tracking

Figure 4 for Orientation-Discriminative Feature Representation for Decentralized Pedestrian Tracking

Abstract:This paper focuses on the problem of decentralized pedestrian tracking using a sensor network. Traditional works on pedestrian tracking usually use a centralized framework, which becomes less practical for robotic applications due to limited communication bandwidth. Our paper proposes a communication-efficient, orientation-discriminative feature representation to characterize pedestrian appearance information, that can be shared among sensors. Building upon that representation, our work develops a cross-sensor track association approach to achieve decentralized tracking. Extensive evaluations are conducted on publicly available datasets and results show that our proposed approach leads to improved performance in multi-sensor tracking.

* 8 pages, 4 figures, submitted to IEEE/RSJ International Conference on Intelligent Robots and Systems

Via

Access Paper or Ask Questions

Stay Positive: Non-Negative Image Synthesis for Augmented Reality

Feb 01, 2022

Katie Luo, Guandao Yang, Wenqi Xian, Harald Haraldsson, Bharath Hariharan, Serge Belongie

Figure 1 for Stay Positive: Non-Negative Image Synthesis for Augmented Reality

Figure 2 for Stay Positive: Non-Negative Image Synthesis for Augmented Reality

Figure 3 for Stay Positive: Non-Negative Image Synthesis for Augmented Reality

Figure 4 for Stay Positive: Non-Negative Image Synthesis for Augmented Reality

Abstract:In applications such as optical see-through and projector augmented reality, producing images amounts to solving non-negative image generation, where one can only add light to an existing image. Most image generation methods, however, are ill-suited to this problem setting, as they make the assumption that one can assign arbitrary color to each pixel. In fact, naive application of existing methods fails even in simple domains such as MNIST digits, since one cannot create darker pixels by adding light. We know, however, that the human visual system can be fooled by optical illusions involving certain spatial configurations of brightness and contrast. Our key insight is that one can leverage this behavior to produce high quality images with negligible artifacts. For example, we can create the illusion of darker patches by brightening surrounding pixels. We propose a novel optimization procedure to produce images that satisfy both semantic and non-negativity constraints. Our approach can incorporate existing state-of-the-art methods, and exhibits strong performance in a variety of tasks including image-to-image translation and style transfer.

* Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10050-10060

Via

Access Paper or Ask Questions

Field-Guide-Inspired Zero-Shot Learning

Aug 24, 2021

Utkarsh Mall, Bharath Hariharan, Kavita Bala

Figure 1 for Field-Guide-Inspired Zero-Shot Learning

Figure 2 for Field-Guide-Inspired Zero-Shot Learning

Figure 3 for Field-Guide-Inspired Zero-Shot Learning

Figure 4 for Field-Guide-Inspired Zero-Shot Learning

Abstract:Modern recognition systems require large amounts of supervision to achieve accuracy. Adapting to new domains requires significant data from experts, which is onerous and can become too expensive. Zero-shot learning requires an annotated set of attributes for a novel category. Annotating the full set of attributes for a novel category proves to be a tedious and expensive task in deployment. This is especially the case when the recognition domain is an expert domain. We introduce a new field-guide-inspired approach to zero-shot annotation where the learner model interactively asks for the most useful attributes that define a class. We evaluate our method on classification benchmarks with attribute annotations like CUB, SUN, and AWA2 and show that our model achieves the performance of a model with full annotations at the cost of a significantly fewer number of annotations. Since the time of experts is precious, decreasing annotation cost can be very valuable for real-world deployment.

* Accepted to ICCV 2021

Via

Access Paper or Ask Questions

Extreme Rotation Estimation using Dense Correlation Volumes

Apr 28, 2021

Ruojin Cai, Bharath Hariharan, Noah Snavely, Hadar Averbuch-Elor

Figure 1 for Extreme Rotation Estimation using Dense Correlation Volumes

Figure 2 for Extreme Rotation Estimation using Dense Correlation Volumes

Figure 3 for Extreme Rotation Estimation using Dense Correlation Volumes

Figure 4 for Extreme Rotation Estimation using Dense Correlation Volumes

Abstract:We present a technique for estimating the relative 3D rotation of an RGB image pair in an extreme setting, where the images have little or no overlap. We observe that, even when images do not overlap, there may be rich hidden cues as to their geometric relationship, such as light source directions, vanishing points, and symmetries present in the scene. We propose a network design that can automatically learn such implicit cues by comparing all pairs of points between the two input images. Our method therefore constructs dense feature correlation volumes and processes these to predict relative 3D rotations. Our predictions are formed over a fine-grained discretization of rotations, bypassing difficulties associated with regressing 3D rotations. We demonstrate our approach on a large variety of extreme RGB image pairs, including indoor and outdoor images captured under different lighting conditions and geographic locations. Our evaluation shows that our model can successfully estimate relative rotations among non-overlapping images without compromising performance over overlapping image pairs.

* Published in CVPR 2021; Project page: https://ruojincai.github.io/ExtremeRotation/

Via

Access Paper or Ask Questions

PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering

Mar 30, 2021

Jang Hyun Cho, Utkarsh Mall, Kavita Bala, Bharath Hariharan

Figure 1 for PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering

Figure 2 for PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering

Figure 3 for PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering

Figure 4 for PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering

Abstract:We present a new framework for semantic segmentation without annotations via clustering. Off-the-shelf clustering methods are limited to curated, single-label, and object-centric images yet real-world data are dominantly uncurated, multi-label, and scene-centric. We extend clustering from images to pixels and assign separate cluster membership to different instances within each image. However, solely relying on pixel-wise feature similarity fails to learn high-level semantic concepts and overfits to low-level visual cues. We propose a method to incorporate geometric consistency as an inductive bias to learn invariance and equivariance for photometric and geometric variations. With our novel learning objective, our framework can learn high-level semantic concepts. Our method, PiCIE (Pixel-level feature Clustering using Invariance and Equivariance), is the first method capable of segmenting both things and stuff categories without any hyperparameter tuning or task-specific pre-processing. Our method largely outperforms existing baselines on COCO and Cityscapes with +17.5 Acc. and +4.5 mIoU. We show that PiCIE gives a better initialization for standard supervised training. The code is available at https://github.com/janghyuncho/PiCIE.

* CVPR 2021

Via

Access Paper or Ask Questions

Exploiting Playbacks in Unsupervised Domain Adaptation for 3D Object Detection

Mar 26, 2021

Yurong You, Carlos Andres Diaz-Ruiz, Yan Wang, Wei-Lun Chao, Bharath Hariharan, Mark Campbell, Kilian Q Weinberger

Figure 1 for Exploiting Playbacks in Unsupervised Domain Adaptation for 3D Object Detection

Figure 2 for Exploiting Playbacks in Unsupervised Domain Adaptation for 3D Object Detection

Figure 3 for Exploiting Playbacks in Unsupervised Domain Adaptation for 3D Object Detection

Figure 4 for Exploiting Playbacks in Unsupervised Domain Adaptation for 3D Object Detection

Abstract:Self-driving cars must detect other vehicles and pedestrians in 3D to plan safe routes and avoid collisions. State-of-the-art 3D object detectors, based on deep learning, have shown promising accuracy but are prone to over-fit to domain idiosyncrasies, making them fail in new environments -- a serious problem if autonomous vehicles are meant to operate freely. In this paper, we propose a novel learning approach that drastically reduces this gap by fine-tuning the detector on pseudo-labels in the target domain, which our method generates while the vehicle is parked, based on replays of previously recorded driving sequences. In these replays, objects are tracked over time, and detections are interpolated and extrapolated -- crucially, leveraging future information to catch hard cases. We show, on five autonomous driving datasets, that fine-tuning the object detector on these pseudo-labels substantially reduces the domain gap to new driving environments, yielding drastic improvements in accuracy and detection reliability.

* RAL-IROS 2021 submission

Via

Access Paper or Ask Questions

Fine-Grained Few-Shot Classification with Feature Map Reconstruction Networks

Dec 02, 2020

Davis Wertheimer, Luming Tang, Bharath Hariharan

Figure 1 for Fine-Grained Few-Shot Classification with Feature Map Reconstruction Networks

Figure 2 for Fine-Grained Few-Shot Classification with Feature Map Reconstruction Networks

Figure 3 for Fine-Grained Few-Shot Classification with Feature Map Reconstruction Networks

Figure 4 for Fine-Grained Few-Shot Classification with Feature Map Reconstruction Networks

Abstract:In this paper we reformulate few-shot classification as a reconstruction problem in latent space. The ability of the network to reconstruct a query feature map from support features of a given class predicts membership of the query in that class. We introduce a novel mechanism for few-shot classification by regressing directly from support features to query features in closed form, without introducing any new modules or large-scale learnable parameters. The resulting Feature Map Reconstruction Networks are both more performant and computationally efficient than previous approaches. We demonstrate consistent and significant accuracy gains on four fine-grained benchmarks with varying neural architectures. Our model is also competitive on the non-fine-grained mini-ImageNet benchmark with minimal bells and whistles.

Via

Access Paper or Ask Questions