Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marcus Rohrbach

TextCaps: a Dataset for Image Captioning with Reading Comprehension

Mar 24, 2020

Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh

Figure 1 for TextCaps: a Dataset for Image Captioning with Reading Comprehension

Figure 2 for TextCaps: a Dataset for Image Captioning with Reading Comprehension

Figure 3 for TextCaps: a Dataset for Image Captioning with Reading Comprehension

Figure 4 for TextCaps: a Dataset for Image Captioning with Reading Comprehension

Abstract:Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include written text in their descriptions, although text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension. Our analysis with automatic and human studies shows that our new TextCaps dataset provides many new technical challenges over previous datasets.

* https://textvqa.org/textcaps

Via

Access Paper or Ask Questions

Adversarial Continual Learning

Mar 21, 2020

Sayna Ebrahimi, Franziska Meier, Roberto Calandra, Trevor Darrell, Marcus Rohrbach

Figure 1 for Adversarial Continual Learning

Figure 2 for Adversarial Continual Learning

Figure 3 for Adversarial Continual Learning

Figure 4 for Adversarial Continual Learning

Abstract:Continual learning aims to learn new tasks without forgetting previously learned ones. We hypothesize that representations learned to solve each task in a sequence have a shared structure while containing some task-specific properties. We show that shared features are significantly less prone to forgetting and propose a novel hybrid continual learning framework that learns a disjoint representation for task-invariant and task-specific features required to solve a sequence of tasks. Our model combines architecture growth to prevent forgetting of task-specific skills and an experience replay approach to preserve shared skills. We demonstrate our hybrid approach is effective in avoiding forgetting and show it is superior to both architecture-based and memory-based approaches on class incrementally learning of a single dataset as well as a sequence of multiple datasets in image classification. Our code is available at \url{https://github.com/facebookresearch/Adversarial-Continual-Learning}.

Via

Access Paper or Ask Questions

In Defense of Grid Features for Visual Question Answering

Jan 10, 2020

Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen

Figure 1 for In Defense of Grid Features for Visual Question Answering

Figure 2 for In Defense of Grid Features for Visual Question Answering

Figure 3 for In Defense of Grid Features for Visual Question Answering

Figure 4 for In Defense of Grid Features for Visual Question Answering

Abstract:Popularized as 'bottom-up' attention, bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA). However, it is not clear whether the advantages of regions (e.g. better localization) are the key reasons for the success of bottom-up attention. In this paper, we revisit grid features for VQA and find they can work surprisingly well-running more than an order of magnitude faster with the same accuracy. Through extensive experiments, we verify that this observation holds true across different VQA models, datasets, and generalizes well to other tasks like image captioning. As grid features make the model design and training process much simpler, this enables us to train them end-to-end and also use a more flexible network design. We learn VQA models end-to-end, from pixels directly to answers, and show that strong performance is achievable without using any region annotations in pre-training. We hope our findings help further improve the scientific understanding and the practical application of VQA. Code and features will be made available.

Via

Access Paper or Ask Questions

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Dec 05, 2019

Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach

Figure 1 for Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Figure 2 for Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Figure 3 for Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Figure 4 for Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA

Abstract:Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label on a warning sign warns people about the danger in the scene. Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question. However, existing approaches for TextVQA are mostly based on custom pairwise fusion mechanisms between a pair of two modalities and are restricted to a single prediction step by casting TextVQA as a classification task. In this work, we propose a novel model for the TextVQA task based on a multimodal transformer architecture accompanied by a rich representation for text in images. Our model naturally fuses different modalities homogeneously by embedding them into a common semantic space where self-attention is applied to model inter- and intra- modality context. Furthermore, it enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification. Our model outperforms existing approaches on three benchmark datasets for the TextVQA task by a large margin.

Via

Access Paper or Ask Questions

12-in-1: Multi-Task Vision and Language Representation Learning

Dec 05, 2019

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee

Figure 1 for 12-in-1: Multi-Task Vision and Language Representation Learning

Figure 2 for 12-in-1: Multi-Task Vision and Language Representation Learning

Figure 3 for 12-in-1: Multi-Task Vision and Language Representation Learning

Figure 4 for 12-in-1: Multi-Task Vision and Language Representation Learning

Abstract:Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.

* Jiasen Lu and Vedanuj Goswami contributed equally to this work

Via

Access Paper or Ask Questions

Decoupling Representation and Classifier for Long-Tailed Recognition

Oct 21, 2019

Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, Yannis Kalantidis

Figure 1 for Decoupling Representation and Classifier for Long-Tailed Recognition

Figure 2 for Decoupling Representation and Classifier for Long-Tailed Recognition

Figure 3 for Decoupling Representation and Classifier for Long-Tailed Recognition

Figure 4 for Decoupling Representation and Classifier for Long-Tailed Recognition

Abstract:The long-tail distribution of the visual world poses great challenges for deep learning based classification models on how to handle the class imbalance problem. Existing solutions usually involve class-balancing strategies, e.g., by loss re-weighting, data re-sampling, or transfer learning from head- to tail-classes, but most of them adhere to the scheme of jointly learning representations and classifiers. In this work, we decouple the learning procedure into representation learning and classification, and systematically explore how different balancing strategies affect them for long-tailed recognition. The findings are surprising: (1) data imbalance might not be an issue in learning high-quality representations; (2) with representations learned with the simplest instance-balanced (natural) sampling, it is also possible to achieve strong long-tailed recognition ability at little cost by adjusting only the classifier. We conduct extensive experiments and set new state-of-the-art performance on common long-tailed benchmarks like ImageNet-LT, Places-LT and iNaturalist, showing that it is possible to outperform carefully designed losses, sampling strategies, even complex modules with memory, by using a straightforward approach that decouples representation and classification.

Via

Access Paper or Ask Questions

Uncertainty-guided Continual Learning with Bayesian Neural Networks

Jun 06, 2019

Sayna Ebrahimi, Mohamed Elhoseiny, Trevor Darrell, Marcus Rohrbach

Figure 1 for Uncertainty-guided Continual Learning with Bayesian Neural Networks

Figure 2 for Uncertainty-guided Continual Learning with Bayesian Neural Networks

Figure 3 for Uncertainty-guided Continual Learning with Bayesian Neural Networks

Figure 4 for Uncertainty-guided Continual Learning with Bayesian Neural Networks

Abstract:Continual learning aims to learn new tasks without forgetting previously learned ones. This is especially challenging when one cannot access data from previous tasks and when the model has a fixed capacity. Current regularization-based continual learning algorithms need an external representation and extra computation to measure the parameters' importance. In contrast, we propose Uncertainty-guided Continual Bayesian Neural Networks (UCB), where the learning rate adapts according to the uncertainty defined in the probability distribution of the weights in networks. Uncertainty is a natural way to identify what to remember and what to change as we continually learn, allowing to mitigate catastrophic forgetting. We also show a variant of our model, which uses uncertainty for weight pruning and retains task performance after pruning by saving binary masks per tasks. We evaluate our UCB approach extensively on diverse object classification datasets with short and long sequences of tasks and report superior or on-par performance compared to existing approaches. Additionally, we show that our model does not necessarily need task information at test time, i.e. it does not presume knowledge of which task a sample belongs to.

Via

Access Paper or Ask Questions

Learning to Generate Grounded Image Captions without Localization Supervision

Jun 01, 2019

Chih-Yao Ma, Yannis Kalantidis, Ghassan AlRegib, Peter Vajda, Marcus Rohrbach, Zsolt Kira

Figure 1 for Learning to Generate Grounded Image Captions without Localization Supervision

Figure 2 for Learning to Generate Grounded Image Captions without Localization Supervision

Figure 3 for Learning to Generate Grounded Image Captions without Localization Supervision

Figure 4 for Learning to Generate Grounded Image Captions without Localization Supervision

Abstract:When generating a sentence description for an image, it frequently remains unclear how well the generated caption is grounded in the image or if the model hallucinates based on priors in the dataset and/or the language model. The most common way of relating image regions with words in caption models is through an attention mechanism over the regions that is used as input to predict the next word. The model must therefore learn to predict the attention without knowing the word it should localize. In this work, we propose a novel cyclical training regimen that forces the model to localize each word in the image after the sentence decoder generates it and then reconstruct the sentence from the localized image region(s) to match the ground-truth. The initial decoder and the proposed reconstructor share parameters during training and are learned jointly with the localizer, allowing the model to regularize the attention mechanism. Our proposed framework only requires learning one extra fully-connected layer (the localizer), a layer that can be removed at test time. We show that our model significantly improves grounding accuracy without relying on grounding supervision or introducing extra computation during inference.

Via

Access Paper or Ask Questions

Towards VQA Models That Can Read

May 13, 2019

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, Marcus Rohrbach

Figure 1 for Towards VQA Models That Can Read

Figure 2 for Towards VQA Models That Can Read

Figure 3 for Towards VQA Models That Can Read

Figure 4 for Towards VQA Models That Can Read

Abstract:Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today's VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new "TextVQA" dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA). We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset. We find that the gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0.

* CVPR 2019

Via

Access Paper or Ask Questions

Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

Apr 30, 2019

Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, Jiashi Feng

Figure 1 for Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

Figure 2 for Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

Figure 3 for Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

Figure 4 for Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

Abstract:In natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures. Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies. In this work, we propose to factorize the mixed feature maps by their frequencies and design a novel Octave Convolution (OctConv) operation to store and process feature maps that vary spatially "slower" at a lower spatial resolution reducing both memory and computation cost. Unlike existing multi-scale meth-ods, OctConv is formulated as a single, generic, plug-and-play convolutional unit that can be used as a direct replacement of (vanilla) convolutions without any adjustments in the network architecture. It is also orthogonal and complementary to methods that suggest better topologies or reduce channel-wise redundancy like group or depth-wise convolutions. We experimentally show that by simply replacing con-volutions with OctConv, we can consistently boost accuracy for both image and video recognition tasks, while reducing memory and computational cost. An OctConv-equipped ResNet-152 can achieve 82.9% top-1 classification accuracy on ImageNet with merely 22.2 GFLOPs.

Via

Access Paper or Ask Questions