Visual Question Answering (VQA) has received a lot of attention over the past couple of years. A number of deep learning models have been proposed for this task. However, it has been shown that these models are heavily driven by superficial correlations in the training data and lack compositionality -- the ability to answer questions about unseen compositions of seen concepts. This compositionality is desirable and central to intelligence. In this paper, we propose a new setting for Visual Question Answering where the test question-answer pairs are compositionally novel compared to training question-answer pairs. To facilitate developing models under this setting, we present a new compositional split of the VQA v1.0 dataset, which we call Compositional VQA (C-VQA). We analyze the distribution of questions and answers in the C-VQA splits. Finally, we evaluate several existing VQA models under this new setting and show that the performances of these models degrade by a significant amount compared to the original VQA setting.
We propose a technique for producing "visual explanations" for decisions from a large class of CNN-based models, making them more transparent. Our approach - Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept, flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. Unlike previous approaches, GradCAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (e.g. VGG), (2) CNNs used for structured outputs (e.g. captioning), (3) CNNs used in tasks with multimodal inputs (e.g. VQA) or reinforcement learning, without any architectural changes or re-training. We combine GradCAM with fine-grained visualizations to create a high-resolution class-discriminative visualization and apply it to off-the-shelf image classification, captioning, and visual question answering (VQA) models, including ResNet-based architectures. In the context of image classification models, our visualizations (a) lend insights into their failure modes (showing that seemingly unreasonable predictions have reasonable explanations), (b) are robust to adversarial images, (c) outperform previous methods on weakly-supervised localization, (d) are more faithful to the underlying model and (e) help achieve generalization by identifying dataset bias. For captioning and VQA, our visualizations show that even non-attention based models can localize inputs. Finally, we conduct human studies to measure if GradCAM explanations help users establish trust in predictions from deep networks and show that GradCAM helps untrained users successfully discern a "stronger" deep network from a "weaker" one. Our code is available at https://github.com/ramprs/grad-cam. A demo and a video of the demo can be found at http://gradcam.cloudcv.org and youtu.be/COjUB9Izk6E.
We propose a technique for making Convolutional Neural Network (CNN)-based models more transparent by visualizing input regions that are 'important' for predictions -- or visual explanations. Our approach, called Gradient-weighted Class Activation Mapping (Grad-CAM), uses class-specific gradient information to localize important regions. These localizations are combined with existing pixel-space visualizations to create a novel high-resolution and class-discriminative visualization called Guided Grad-CAM. These methods help better understand CNN-based models, including image captioning and visual question answering (VQA) models. We evaluate our visual explanations by measuring their ability to discriminate between classes, to inspire trust in humans, and their correlation with occlusion maps. Grad-CAM provides a new way to understand CNN-based models. We have released code, an online demo hosted on CloudCV, and a full version of this extended abstract.
A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.
Temporal common sense has applications in AI tasks such as QA, multi-document summarization, and human-AI communication. We propose the task of sequencing -- given a jumbled set of aligned image-caption pairs that belong to a story, the task is to sort them such that the output sequence forms a coherent story. We present multiple approaches, via unary (position) and pairwise (order) predictions, and their ensemble-based combinations, achieving strong results on this task. We use both text-based and image-based features, which depict complementary improvements. Using qualitative examples, we demonstrate that our models have learnt interesting aspects of temporal common sense.
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).
Recently, a number of deep-learning based models have been proposed for the task of Visual Question Answering (VQA). The performance of most models is clustered around 60-70%. In this paper we propose systematic methods to analyze the behavior of these models as a first step towards recognizing their strengths and weaknesses, and identifying the most fruitful directions for progress. We analyze two models, one each from two major classes of VQA models -- with-attention and without-attention and show the similarities and differences in the behavior of these models. We also analyze the winning entry of the VQA Challenge 2016. Our behavior analysis reveals that despite recent progress, today's VQA models are "myopic" (tend to fail on sufficiently novel instances), often "jump to conclusions" (converge on a predicted answer after 'listening' to just half the question), and are "stubborn" (do not change their answers across images).
Visual Question Answering (VQA) is the task of answering natural-language questions about images. We introduce the novel problem of determining the relevance of questions to images in VQA. Current VQA models do not reason about whether a question is even related to the given image (e.g. What is the capital of Argentina?) or if it requires information from external resources to answer correctly. This can break the continuity of a dialogue in human-machine interaction. Our approaches for determining relevance are composed of two stages. Given an image and a question, (1) we first determine whether the question is visual or not, (2) if visual, we determine whether the question is relevant to the given image or not. Our approaches, based on LSTM-RNNs, VQA model uncertainty, and caption-question similarity, are able to outperform strong baselines on both relevance tasks. We also present human studies showing that VQA models augmented with such question relevance reasoning are perceived as more intelligent, reasonable, and human-like.
Computer vision methods that quantify the perception of urban environment are increasingly being used to study the relationship between a city's physical appearance and the behavior and health of its residents. Yet, the throughput of current methods is too limited to quantify the perception of cities across the world. To tackle this challenge, we introduce a new crowdsourced dataset containing 110,988 images from 56 cities, and 1,170,000 pairwise comparisons provided by 81,630 online volunteers along six perceptual attributes: safe, lively, boring, wealthy, depressing, and beautiful. Using this data, we train a Siamese-like convolutional neural architecture, which learns from a joint classification and ranking loss, to predict human judgments of pairwise image comparisons. Our results show that crowdsourcing combined with neural networks can produce urban perception data at the global scale.
Deep neural networks have shown striking progress and obtained state-of-the-art results in many AI research fields in the recent years. However, it is often unsatisfying to not know why they predict what they do. In this paper, we address the problem of interpreting Visual Question Answering (VQA) models. Specifically, we are interested in finding what part of the input (pixels in images or words in questions) the VQA model focuses on while answering the question. To tackle this problem, we use two visualization techniques -- guided backpropagation and occlusion -- to find important words in the question and important regions in the image. We then present qualitative and quantitative analyses of these importance maps. We found that even without explicit attention mechanisms, VQA models may sometimes be implicitly attending to relevant regions in the image, and often to appropriate words in the question.