Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jeff Donahue

Visual Discovery at Pinterest

Mar 25, 2017

Andrew Zhai, Dmitry Kislyuk, Yushi Jing, Michael Feng, Eric Tzeng, Jeff Donahue, Yue Li Du, Trevor Darrell

Figure 1 for Visual Discovery at Pinterest

Figure 2 for Visual Discovery at Pinterest

Figure 3 for Visual Discovery at Pinterest

Figure 4 for Visual Discovery at Pinterest

Abstract:Over the past three years Pinterest has experimented with several visual search and recommendation services, including Related Pins (2014), Similar Looks (2015), Flashlight (2016) and Lens (2017). This paper presents an overview of our visual discovery engine powering these services, and shares the rationales behind our technical and product decisions such as the use of object detection and interactive user interfaces. We conclude that this visual discovery engine significantly improves engagement in both search and recommendation tasks.

Via

Access Paper or Ask Questions

Visual Search at Pinterest

Mar 08, 2017

Yushi Jing, David Liu, Dmitry Kislyuk, Andrew Zhai, Jiajing Xu, Jeff Donahue, Sarah Tavel

Abstract:We demonstrate that, with the availability of distributed computation platforms such as Amazon Web Services and open-source tools, it is possible for a small engineering team to build, launch and maintain a cost-effective, large-scale visual search system with widely available tools. We also demonstrate, through a comprehensive set of live experiments at Pinterest, that content recommendation powered by visual search improve user engagement. By sharing our implementation details and the experiences learned from launching a commercial visual search engines from scratch, we hope visual search are more widely incorporated into today's commercial applications.

* in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge and Discovery and Data Mining, 2015

Via

Access Paper or Ask Questions

Context Encoders: Feature Learning by Inpainting

Nov 21, 2016

Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, Alexei A. Efros

Figure 1 for Context Encoders: Feature Learning by Inpainting

Figure 2 for Context Encoders: Feature Learning by Inpainting

Figure 3 for Context Encoders: Feature Learning by Inpainting

Figure 4 for Context Encoders: Feature Learning by Inpainting

Abstract:We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders -- a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s). When training context encoders, we have experimented with both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss. The latter produces much sharper results because it can better handle multiple modes in the output. We found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures. We quantitatively demonstrate the effectiveness of our learned features for CNN pre-training on classification, detection, and segmentation tasks. Furthermore, context encoders can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.

* CVPR 2016
* New results on ImageNet Generation

Via

Access Paper or Ask Questions

Data-dependent Initializations of Convolutional Neural Networks

Sep 22, 2016

Philipp Krähenbühl, Carl Doersch, Jeff Donahue, Trevor Darrell

Figure 1 for Data-dependent Initializations of Convolutional Neural Networks

Figure 2 for Data-dependent Initializations of Convolutional Neural Networks

Figure 3 for Data-dependent Initializations of Convolutional Neural Networks

Figure 4 for Data-dependent Initializations of Convolutional Neural Networks

Abstract:Convolutional Neural Networks spread through computer vision like a wildfire, impacting almost all visual tasks imaginable. Despite this, few researchers dare to train their models from scratch. Most work builds on one of a handful of ImageNet pre-trained models, and fine-tunes or adapts these for specific tasks. This is in large part due to the difficulty of properly initializing these networks from scratch. A small miscalibration of the initial weights leads to vanishing or exploding gradients, as well as poor convergence properties. In this work we present a fast and simple data-dependent initialization procedure, that sets the weights of a network such that all units in the network train at roughly the same rate, avoiding vanishing or exploding gradients. Our initialization matches the current state-of-the-art unsupervised or self-supervised pre-training methods on standard computer vision tasks, such as image classification and object detection, while being roughly three orders of magnitude faster. When combined with pre-training methods, our initialization significantly outperforms prior work, narrowing the gap between supervised and unsupervised pre-training.

* ICLR 2016

Via

Access Paper or Ask Questions

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

May 31, 2016

Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell

Figure 1 for Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Figure 2 for Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Figure 3 for Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Figure 4 for Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Abstract:Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep"' in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

* Originally presented at CVPR 2015 (oral). Updated version (accepted as a TPAMI journal article) includes additional results

Via

Access Paper or Ask Questions

Generating Visual Explanations

Mar 28, 2016

Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, Trevor Darrell

Figure 1 for Generating Visual Explanations

Figure 2 for Generating Visual Explanations

Figure 3 for Generating Visual Explanations

Figure 4 for Generating Visual Explanations

Abstract:Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image aspects which justify visual predictions. We propose a new model that focuses on the discriminating properties of the visible object, jointly predicts a class label, and explains why the predicted label is appropriate for the image. We propose a novel loss function based on sampling and reinforcement learning that learns to generate sentences that realize a global sentence property, such as class specificity. Our results on a fine-grained bird species classification dataset show that our model is able to generate explanations which are not only consistent with an image but also more discriminative than descriptions produced by existing captioning methods.

Via

Access Paper or Ask Questions

Sequence to Sequence -- Video to Text

Oct 19, 2015

Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko

Figure 1 for Sequence to Sequence -- Video to Text

Figure 2 for Sequence to Sequence -- Video to Text

Figure 3 for Sequence to Sequence -- Video to Text

Figure 4 for Sequence to Sequence -- Video to Text

Abstract:Real-world videos often have complex dynamics; and methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem, we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. We evaluate several variants of our model that exploit different visual features on a standard set of YouTube videos and two movie description datasets (M-VAD and MPII-MD).

* ICCV 2015 camera-ready. Includes code, project page and LSMDC challenge results

Via

Access Paper or Ask Questions

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Apr 30, 2015

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko

Figure 1 for Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Figure 2 for Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Figure 3 for Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Figure 4 for Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Abstract:Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep learning for natural language grounding in static images. In this paper, we propose to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure. Described video datasets are scarce, and most existing methods have been applied to toy domains with a small vocabulary of possible words. By transferring knowledge from 1.2M+ images with category labels and 100,000+ images with captions, our method is able to create sentence descriptions of open-domain videos with large vocabularies. We compare our approach with recent work using language generation metrics, subject, verb, and object prediction accuracy, and a human evaluation.

* NAACL-HLT 2015 camera ready

Via

Access Paper or Ask Questions

LSDA: Large Scale Detection Through Adaptation

Nov 01, 2014

Judy Hoffman, Sergio Guadarrama, Eric Tzeng, Ronghang Hu, Jeff Donahue, Ross Girshick, Trevor Darrell, Kate Saenko

Figure 1 for LSDA: Large Scale Detection Through Adaptation

Figure 2 for LSDA: Large Scale Detection Through Adaptation

Figure 3 for LSDA: Large Scale Detection Through Adaptation

Figure 4 for LSDA: Large Scale Detection Through Adaptation

Abstract:A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNNs) have emerged as clear winners on object classification benchmarks, in part due to training with 1.2M+ labeled classification images. Unfortunately, only a small fraction of those labels are available for the detection task. It is much cheaper and easier to collect large quantities of image-level labels from search engines than it is to collect detection data and label it with precise bounding boxes. In this paper, we propose Large Scale Detection through Adaptation (LSDA), an algorithm which learns the difference between the two tasks and transfers this knowledge to classifiers for categories without bounding box annotated data, turning them into detectors. Our method has the potential to enable detection for the tens of thousands of categories that lack bounding box annotations, yet have plenty of classification data. Evaluation on the ImageNet LSVRC-2013 detection challenge demonstrates the efficacy of our approach. This algorithm enables us to produce a >7.6K detector by using available classification data from leaf nodes in the ImageNet tree. We additionally demonstrate how to modify our architecture to produce a fast detector (running at 2fps for the 7.6K detector). Models and software are available at

* Neural Information Processing Systems (NIPS) 2014

Via

Access Paper or Ask Questions

Rich feature hierarchies for accurate object detection and semantic segmentation

Oct 22, 2014

Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik

Figure 1 for Rich feature hierarchies for accurate object detection and semantic segmentation

Figure 2 for Rich feature hierarchies for accurate object detection and semantic segmentation

Figure 3 for Rich feature hierarchies for accurate object detection and semantic segmentation

Figure 4 for Rich feature hierarchies for accurate object detection and semantic segmentation

Abstract:Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012---achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

* Extended version of our CVPR 2014 paper; latest update (v5) includes results using deeper networks (see Appendix G. Changelog)

Via

Access Paper or Ask Questions