Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jacob Devlin

Language Models for Image Captioning: The Quirks and What Works

Oct 14, 2015

Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, Margaret Mitchell

Figure 1 for Language Models for Image Captioning: The Quirks and What Works

Figure 2 for Language Models for Image Captioning: The Quirks and What Works

Figure 3 for Language Models for Image Captioning: The Quirks and What Works

Figure 4 for Language Models for Image Captioning: The Quirks and What Works

Abstract:Two recent approaches have achieved state-of-the-art results in image captioning. The first uses a pipelined process where a set of candidate words is generated by a convolutional neural network (CNN) trained on images, and then a maximum entropy (ME) language model is used to arrange these words into a coherent sentence. The second uses the penultimate activation layer of the CNN as input to a recurrent neural network (RNN) that then generates the caption sequence. In this paper, we compare the merits of these different language modeling approaches for the first time by using the same state-of-the-art CNN as input. We examine issues in the different approaches, including linguistic irregularities, caption repetition, and data set overlap. By combining key aspects of the ME and RNN methods, we achieve a new record performance over previously published results on the benchmark COCO dataset. However, the gains we see in BLEU do not translate to human judgments.

* See http://research.microsoft.com/en-us/projects/image_captioning for project information

Via

Access Paper or Ask Questions

A Survey of Current Datasets for Vision and Language Research

Aug 19, 2015

Francis Ferraro, Nasrin Mostafazadeh, Ting-Hao, Huang, Lucy Vanderwende, Jacob Devlin, Michel Galley, Margaret Mitchell

Figure 1 for A Survey of Current Datasets for Vision and Language Research

Figure 2 for A Survey of Current Datasets for Vision and Language Research

Figure 3 for A Survey of Current Datasets for Vision and Language Research

Abstract:Integrating vision and language has long been a dream in work on artificial intelligence (AI). In the past two years, we have witnessed an explosion of work that brings together vision and language from images to videos and beyond. The available corpora have played a crucial role in advancing this area of research. In this paper, we propose a set of quality metrics for evaluating and analyzing the vision & language datasets and categorize them accordingly. Our analyses show that the most recent datasets have been using more complex language and more abstract concepts, however, there are different strengths and weaknesses in each.

* To appear in EMNLP 2015, short proceedings. Dataset analysis and discussion expanded, including an initial examination into reporting bias for one of them. F.F. and N.M. contributed equally to this work

Via

Access Paper or Ask Questions

Statistical Machine Translation Features with Multitask Tensor Networks

Jun 01, 2015

Hendra Setiawan, Zhongqiang Huang, Jacob Devlin, Thomas Lamar, Rabih Zbib, Richard Schwartz, John Makhoul

Figure 1 for Statistical Machine Translation Features with Multitask Tensor Networks

Figure 2 for Statistical Machine Translation Features with Multitask Tensor Networks

Figure 3 for Statistical Machine Translation Features with Multitask Tensor Networks

Figure 4 for Statistical Machine Translation Features with Multitask Tensor Networks

Abstract:We present a three-pronged approach to improving Statistical Machine Translation (SMT), building on recent success in the application of neural networks to SMT. First, we propose new features based on neural networks to model various non-local translation phenomena. Second, we augment the architecture of the neural network with tensor layers that capture important higher-order interaction among the network units. Third, we apply multitask learning to estimate the neural network parameters jointly. Each of our proposed methods results in significant improvements that are complementary. The overall improvement is +2.7 and +1.8 BLEU points for Arabic-English and Chinese-English translation over a state-of-the-art system that already includes neural network features.

* 11 pages (9 content + 2 references), 2 figures, accepted to ACL 2015 as a long paper

Via

Access Paper or Ask Questions

Exploring Nearest Neighbor Approaches for Image Captioning

May 17, 2015

Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, C. Lawrence Zitnick

Figure 1 for Exploring Nearest Neighbor Approaches for Image Captioning

Figure 2 for Exploring Nearest Neighbor Approaches for Image Captioning

Figure 3 for Exploring Nearest Neighbor Approaches for Image Captioning

Figure 4 for Exploring Nearest Neighbor Approaches for Image Captioning

Abstract:We explore a variety of nearest neighbor baseline approaches for image captioning. These approaches find a set of nearest neighbor images in the training set from which a caption may be borrowed for the query image. We select a caption for the query image by finding the caption that best represents the "consensus" of the set of candidate captions gathered from the nearest neighbor images. When measured by automatic evaluation metrics on the MS COCO caption evaluation server, these approaches perform as well as many recent approaches that generate novel captions. However, human studies show that a method that generates novel captions is still preferred over the nearest neighbor approach.

Via

Access Paper or Ask Questions