Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Robik Shrestha

Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Aug 05, 2019

Kushal Kafle, Robik Shrestha, Brian Price, Scott Cohen, Christopher Kanan

Figure 1 for Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Figure 2 for Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Figure 3 for Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Figure 4 for Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Abstract:Chart question answering (CQA) is a newly proposed visual question answering (VQA) task where an algorithm must answer questions about data visualizations, e.g. bar charts, pie charts, and line graphs. CQA requires capabilities that natural-image VQA algorithms lack: fine-grained measurements, optical character recognition, and handling out-of-vocabulary words in both questions and answers. Without modifications, state-of-the-art VQA algorithms perform poorly on this task. Here, we propose a novel CQA algorithm called parallel recurrent fusion of image and language (PReFIL). PReFIL first learns bimodal embeddings by fusing question and image features and then intelligently aggregates these learned embeddings to answer the given question. Despite its simplicity, PReFIL greatly surpasses state-of-the art systems and human baselines on both the FigureQA and DVQA datasets. Additionally, we demonstrate that PReFIL can be used to reconstruct tables by asking a series of questions about a chart.

Via

Access Paper or Ask Questions

Challenges and Prospects in Vision and Language Research

May 24, 2019

Kushal Kafle, Robik Shrestha, Christopher Kanan

Figure 1 for Challenges and Prospects in Vision and Language Research

Figure 2 for Challenges and Prospects in Vision and Language Research

Figure 3 for Challenges and Prospects in Vision and Language Research

Figure 4 for Challenges and Prospects in Vision and Language Research

Abstract:Language grounded image understanding tasks have often been proposed as a method for evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of capabilities that integrate computer vision, reasoning, and natural language understanding. However, rather than behaving as visual Turing tests, recent studies have demonstrated state-of-the-art systems are achieving good performance through flaws in datasets and evaluation procedures. We review the current state of affairs and outline a path forward.

Via

Access Paper or Ask Questions

Answer Them All! Toward Universal Visual Question Answering Models

Apr 05, 2019

Robik Shrestha, Kushal Kafle, Christopher Kanan

Figure 1 for Answer Them All! Toward Universal Visual Question Answering Models

Figure 2 for Answer Them All! Toward Universal Visual Question Answering Models

Figure 3 for Answer Them All! Toward Universal Visual Question Answering Models

Figure 4 for Answer Them All! Toward Universal Visual Question Answering Models

Abstract:Visual Question Answering (VQA) research is split into two camps: the first focuses on VQA datasets that require natural image understanding and the second focuses on synthetic datasets that test reasoning. A good VQA algorithm should be capable of both, but only a few VQA algorithms are tested in this manner. We compare five state-of-the-art VQA algorithms across eight VQA datasets covering both domains. To make the comparison fair, all of the models are standardized as much as possible, e.g., they use the same visual features, answer vocabularies, etc. We find that methods do not generalize across the two domains. To address this problem, we propose a new VQA algorithm that rivals or exceeds the state-of-the-art for both domains.

* 8 pages

Via

Access Paper or Ask Questions