Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexander G. Schwing

Diverse and Coherent Paragraph Generation from Images

Sep 03, 2018

Moitreya Chatterjee, Alexander G. Schwing

Figure 1 for Diverse and Coherent Paragraph Generation from Images

Figure 2 for Diverse and Coherent Paragraph Generation from Images

Figure 3 for Diverse and Coherent Paragraph Generation from Images

Figure 4 for Diverse and Coherent Paragraph Generation from Images

Abstract:Paragraph generation from images, which has gained popularity recently, is an important task for video summarization, editing, and support of the disabled. Traditional image captioning methods fall short on this front, since they aren't designed to generate long informative descriptions. Moreover, the vanilla approach of simply concatenating multiple short sentences, possibly synthesized from a classical image captioning system, doesn't embrace the intricacies of paragraphs: coherent sentences, globally consistent structure, and diversity. To address those challenges, we propose to augment paragraph generation techniques with 'coherence vectors', 'global topic vectors', and modeling of the inherent ambiguity of associating paragraphs with images, via a variational auto-encoder formulation. We demonstrate the effectiveness of the developed approach on two datasets, outperforming existing state-of-the-art techniques on both.

* Camera Ready Version of ECCV 2018 paper; Coupled with supplementary

Via

Access Paper or Ask Questions

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Mar 29, 2018

Raymond A. Yeh, Jinjun Xiong, Wen-mei W. Hwu, Minh N. Do, Alexander G. Schwing

Figure 1 for Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Figure 2 for Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Figure 3 for Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Figure 4 for Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Abstract:Textual grounding is an important but challenging task for human-computer interaction, robotics and knowledge mining. Existing algorithms generally formulate the task as selection from a set of bounding box proposals obtained from deep net based systems. In this work, we demonstrate that we can cast the problem of textual grounding into a unified framework that permits efficient search over all possible bounding boxes. Hence, the method is able to consider significantly more proposals and doesn't rely on a successful first stage hypothesizing bounding box proposals. Beyond, we demonstrate that the trained parameters of our model can be used as word-embeddings which capture spatial-image relationships and provide interpretability. Lastly, at the time of submission, our approach outperformed the current state-of-the-art methods on the Flickr 30k Entities and the ReferItGame dataset by 3.08% and 7.77% respectively.

* Accepted to NIPS 2017

Via

Access Paper or Ask Questions

MaskRNN: Instance Level Video Object Segmentation

Mar 29, 2018

Yuan-Ting Hu, Jia-Bin Huang, Alexander G. Schwing

Figure 1 for MaskRNN: Instance Level Video Object Segmentation

Figure 2 for MaskRNN: Instance Level Video Object Segmentation

Figure 3 for MaskRNN: Instance Level Video Object Segmentation

Figure 4 for MaskRNN: Instance Level Video Object Segmentation

Abstract:Instance level video object segmentation is an important technique for video editing and compression. To capture the temporal coherence, in this paper, we develop MaskRNN, a recurrent neural net approach which fuses in each frame the output of two deep nets for each object instance -- a binary segmentation net providing a mask and a localization net providing a bounding box. Due to the recurrent component and the localization component, our method is able to take advantage of long-term temporal structures of the video data as well as rejecting outliers. We validate the proposed algorithm on three challenging benchmark datasets, the DAVIS-2016 dataset, the DAVIS-2017 dataset, and the Segtrack v2 dataset, achieving state-of-the-art performance on all of them.

* Accepted to NIPS 2017

Via

Access Paper or Ask Questions

Unsupervised Textual Grounding: Linking Words to Image Concepts

Mar 29, 2018

Raymond A. Yeh, Minh N. Do, Alexander G. Schwing

Figure 1 for Unsupervised Textual Grounding: Linking Words to Image Concepts

Figure 2 for Unsupervised Textual Grounding: Linking Words to Image Concepts

Figure 3 for Unsupervised Textual Grounding: Linking Words to Image Concepts

Figure 4 for Unsupervised Textual Grounding: Linking Words to Image Concepts

Abstract:Textual grounding, i.e., linking words to objects in images, is a challenging but important task for robotics and human-computer interaction. Existing techniques benefit from recent progress in deep learning and generally formulate the task as a supervised learning problem, selecting a bounding box from a set of possible options. To train these deep net based approaches, access to a large-scale datasets is required, however, constructing such a dataset is time-consuming and expensive. Therefore, we develop a completely unsupervised mechanism for textual grounding using hypothesis testing as a mechanism to link words to detected image concepts. We demonstrate our approach on the ReferIt Game dataset and the Flickr30k data, outperforming baselines by 7.98% and 6.96% respectively.

* Accepted to CVPR 2018

Via

Access Paper or Ask Questions

Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Nov 19, 2017

Liwei Wang, Alexander G. Schwing, Svetlana Lazebnik

Figure 1 for Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Figure 2 for Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Figure 3 for Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Figure 4 for Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space

Abstract:This paper explores image caption generation using conditional variational auto-encoders (CVAEs). Standard CVAEs with a fixed Gaussian prior yield descriptions with too little variability. Instead, we propose two models that explicitly structure the latent space around $K$ components corresponding to different types of image content, and combine components to create priors for images that contain multiple types of content simultaneously (e.g., several kinds of objects). Our first model uses a Gaussian Mixture model (GMM) prior, while the second one defines a novel Additive Gaussian (AG) prior that linearly combines component means. We show that both models produce captions that are more diverse and more accurate than a strong LSTM baseline or a "vanilla" CVAE with a fixed Gaussian prior, with AG-CVAE showing particular promise.

Via

Access Paper or Ask Questions

High-Order Attention Models for Visual Question Answering

Nov 12, 2017

Idan Schwartz, Alexander G. Schwing, Tamir Hazan

Figure 1 for High-Order Attention Models for Visual Question Answering

Figure 2 for High-Order Attention Models for Visual Question Answering

Figure 3 for High-Order Attention Models for Visual Question Answering

Figure 4 for High-Order Attention Models for Visual Question Answering

Abstract:The quest for algorithms that enable cognitive abilities is an important part of machine learning. A common trait in many recently investigated cognitive-like tasks is that they take into account different data modalities, such as visual and textual input. In this paper we propose a novel and generally applicable form of attention mechanism that learns high-order correlations between various data modalities. We show that high-order correlations effectively direct the appropriate attention to the relevant elements in the different data modalities that are required to solve the joint task. We demonstrate the effectiveness of our high-order attention mechanism on the task of visual question answering (VQA), where we achieve state-of-the-art performance on the standard VQA dataset.

* 9 pages, 8 figures, NIPS 2017

Via

Access Paper or Ask Questions

Semantic Image Inpainting with Deep Generative Models

Jul 13, 2017

Raymond A. Yeh, Chen Chen, Teck Yian Lim, Alexander G. Schwing, Mark Hasegawa-Johnson, Minh N. Do

Figure 1 for Semantic Image Inpainting with Deep Generative Models

Figure 2 for Semantic Image Inpainting with Deep Generative Models

Figure 3 for Semantic Image Inpainting with Deep Generative Models

Figure 4 for Semantic Image Inpainting with Deep Generative Models

Abstract:Semantic image inpainting is a challenging task where large missing regions have to be filled based on the available visual data. Existing methods which extract information from only a single image generally produce unsatisfactory results due to the lack of high level context. In this paper, we propose a novel method for semantic image inpainting, which generates the missing content by conditioning on the available data. Given a trained generative model, we search for the closest encoding of the corrupted image in the latent image manifold using our context and prior losses. This encoding is then passed through the generative model to infer the missing content. In our method, inference is possible irrespective of how the missing content is structured, while the state-of-the-art learning based method requires specific information about the holes in the training phase. Experiments on three datasets show that our method successfully predicts information in large missing regions and achieves pixel-level photorealism, significantly outperforming the state-of-the-art methods.

Via

Access Paper or Ask Questions

Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening

Nov 05, 2016

Frank S. He, Yang Liu, Alexander G. Schwing, Jian Peng

Figure 1 for Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening

Figure 2 for Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening

Figure 3 for Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening

Figure 4 for Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening

Abstract:We propose a novel training algorithm for reinforcement learning which combines the strength of deep Q-learning with a constrained optimization approach to tighten optimality and encourage faster reward propagation. Our novel technique makes deep reinforcement learning more practical by drastically reducing the training time. We evaluate the performance of our approach on the 49 games of the challenging Arcade Learning Environment, and report significant improvements in both training time and accuracy.

Via

Access Paper or Ask Questions

Training Deep Neural Networks via Direct Loss Minimization

Jun 02, 2016

Yang Song, Alexander G. Schwing, Richard S. Zemel, Raquel Urtasun

Figure 1 for Training Deep Neural Networks via Direct Loss Minimization

Figure 2 for Training Deep Neural Networks via Direct Loss Minimization

Figure 3 for Training Deep Neural Networks via Direct Loss Minimization

Figure 4 for Training Deep Neural Networks via Direct Loss Minimization

Abstract:Supervised training of deep neural nets typically relies on minimizing cross-entropy. However, in many domains, we are interested in performing well on metrics specific to the application. In this paper we propose a direct loss minimization approach to train deep neural networks, which provably minimizes the application-specific loss function. This is often non-trivial, since these functions are neither smooth nor decomposable and thus are not amenable to optimization with standard gradient-based methods. We demonstrate the effectiveness of our approach in the context of maximizing average precision for ranking problems. Towards this goal, we develop a novel dynamic programming algorithm that can efficiently compute the weight updates. Our approach proves superior to a variety of baselines in the context of action classification and object detection, especially in the presence of label noise.

* ICML2016

Via

Access Paper or Ask Questions

Monocular Object Instance Segmentation and Depth Ordering with CNNs

Dec 18, 2015

Ziyu Zhang, Alexander G. Schwing, Sanja Fidler, Raquel Urtasun

Figure 1 for Monocular Object Instance Segmentation and Depth Ordering with CNNs

Figure 2 for Monocular Object Instance Segmentation and Depth Ordering with CNNs

Figure 3 for Monocular Object Instance Segmentation and Depth Ordering with CNNs

Figure 4 for Monocular Object Instance Segmentation and Depth Ordering with CNNs

Abstract:In this paper we tackle the problem of instance-level segmentation and depth ordering from a single monocular image. Towards this goal, we take advantage of convolutional neural nets and train them to directly predict instance-level segmentations where the instance ID encodes the depth ordering within image patches. To provide a coherent single explanation of an image we develop a Markov random field which takes as input the predictions of convolutional neural nets applied at overlapping patches of different resolutions, as well as the output of a connected component algorithm. It aims to predict accurate instance-level segmentation and depth ordering. We demonstrate the effectiveness of our approach on the challenging KITTI benchmark and show good performance on both tasks.

* International Conference on Computer Vision (ICCV), 2015

Via

Access Paper or Ask Questions