Alert button
Picture for Sudheendra Vijayanarasimhan

Sudheendra Vijayanarasimhan

Alert button

$IC^3$: Image Captioning by Committee Consensus

Feb 16, 2023
David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, John Canny

Figure 1 for $IC^3$: Image Captioning by Committee Consensus
Figure 2 for $IC^3$: Image Captioning by Committee Consensus
Figure 3 for $IC^3$: Image Captioning by Committee Consensus
Figure 4 for $IC^3$: Image Captioning by Committee Consensus

If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to approximate the reference distribution of image captions, however, doing so encourages captions that are viewpoint-impoverished. Such captions often focus on only a subset of the possible details, while ignoring potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" ($IC^3$), designed to generate a single caption that captures high-level details from several viewpoints. Notably, humans rate captions produced by $IC^3$ at least as helpful as baseline SOTA models more than two thirds of the time, and $IC^3$ captions can improve the performance of SOTA automated recall systems by up to 84%, indicating significant material improvements over existing SOTA approaches for visual description. Our code is publicly available at https://github.com/DavidMChan/caption-by-committee

Viaarxiv icon

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Dec 20, 2022
Vivek Rathod, Bryan Seybold, Sudheendra Vijayanarasimhan, Austin Myers, Xiuye Gu, Vighnesh Birodkar, David A. Ross

Figure 1 for Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features
Figure 2 for Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features
Figure 3 for Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features
Figure 4 for Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.

Viaarxiv icon

Distribution Aware Metrics for Conditional Natural Language Generation

Sep 29, 2022
David M Chan, Yiming Ni, David A Ross, Sudheendra Vijayanarasimhan, Austin Myers, John Canny

Figure 1 for Distribution Aware Metrics for Conditional Natural Language Generation
Figure 2 for Distribution Aware Metrics for Conditional Natural Language Generation
Figure 3 for Distribution Aware Metrics for Conditional Natural Language Generation
Figure 4 for Distribution Aware Metrics for Conditional Natural Language Generation

Traditional automated metrics for evaluating conditional natural language generation use pairwise comparisons between a single generated text and the best-matching gold-standard ground truth text. When multiple ground truths are available, scores are aggregated using an average or max operation across references. While this approach works well when diversity in the ground truth data (i.e. dispersion of the distribution of conditional texts) can be ascribed to noise, such as in automated speech recognition, it does not allow for robust evaluation in the case where diversity in the ground truths represents signal for the model. In this work we argue that existing metrics are not appropriate for domains such as visual description or summarization where ground truths are semantically diverse, and where the diversity in those captions captures useful additional information about the context. We propose a novel paradigm for multi-candidate evaluation of conditional language generation models, and a new family of metrics that compare the distributions of reference and model-generated caption sets using small sample sets of each. We demonstrate the utility of our approach with a case study in visual description: where we show that existing models optimize for single-description quality over diversity, and gain some insights into how sampling methods and temperature impact description quality and diversity.

Viaarxiv icon

What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

May 12, 2022
David M. Chan, Austin Myers, Sudheendra Vijayanarasimhan, David A. Ross, Bryan Seybold, John F. Canny

Figure 1 for What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics
Figure 2 for What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics
Figure 3 for What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics
Figure 4 for What's in a Caption? Dataset-Specific Linguistic Diversity and Its Effect on Visual Description Models and Metrics

While there have been significant gains in the field of automated video description, the generalization performance of automated description models to novel domains remains a major barrier to using these systems in the real world. Most visual description methods are known to capture and exploit patterns in the training data leading to evaluation metric increases, but what are those patterns? In this work, we examine several popular visual description datasets, and capture, analyze, and understand the dataset-specific linguistic patterns that models exploit but do not generalize to new domains. At the token level, sample level, and dataset level, we find that caption diversity is a major driving factor behind the generation of generic and uninformative captions. We further show that state-of-the-art models even outperform held-out ground truth captions on modern metrics, and that this effect is an artifact of linguistic diversity in datasets. Understanding this linguistic diversity is key to building strong captioning models, we recommend several methods and approaches for maintaining diversity in the collection of new data, and dealing with the consequences of limited diversity when using current models and metrics.

* The 1st Workshop on Vision Datasets Understanding, IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2022 
Viaarxiv icon

Active Learning for Video Description With Cluster-Regularized Ensemble Ranking

Jul 29, 2020
David M. Chan, Sudheendra Vijayanarasimhan, David A. Ross, John Canny

Figure 1 for Active Learning for Video Description With Cluster-Regularized Ensemble Ranking
Figure 2 for Active Learning for Video Description With Cluster-Regularized Ensemble Ranking
Figure 3 for Active Learning for Video Description With Cluster-Regularized Ensemble Ranking
Figure 4 for Active Learning for Video Description With Cluster-Regularized Ensemble Ranking

Automatic video captioning aims to train models to generate text descriptions for all segments in a video, however, the most effective approaches require large amounts of manual annotation which is slow and expensive. Active learning is a promising way to efficiently build a training set for video captioning tasks while reducing the need to manually label uninformative examples. In this work we both explore various active learning approaches for automatic video captioning and show that a cluster-regularized ensemble strategy provides the best active learning approach to efficiently gather training sets for video captioning. We evaluate our approaches on the MSR-VTT and LSMDC datasets using both transformer and LSTM based captioning models and show that our novel strategy can achieve high performance while using up to 60% fewer training data than the strong state of the art baselines.

Viaarxiv icon

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

Apr 30, 2018
Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

Figure 1 for AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Figure 2 for AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Figure 3 for AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
Figure 4 for AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

This paper introduces a video dataset of spatio-temporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips. We will release the dataset publicly. AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.

* To appear in CVPR 2018. Check dataset page https://research.google.com/ava/ for details 
Viaarxiv icon

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Apr 20, 2018
Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A. Ross, Jia Deng, Rahul Sukthankar

Figure 1 for Rethinking the Faster R-CNN Architecture for Temporal Action Localization
Figure 2 for Rethinking the Faster R-CNN Architecture for Temporal Action Localization
Figure 3 for Rethinking the Faster R-CNN Architecture for Temporal Action Localization
Figure 4 for Rethinking the Faster R-CNN Architecture for Temporal Action Localization

We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations; (2) we better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localization on THUMOS'14 detection benchmark and competitive performance on ActivityNet challenge.

* Accepted in CVPR 2018 
Viaarxiv icon