Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Philip H. S. Torr

University of Oxford

FlipDial: A Generative Model for Two-Way Visual Dialogue

Apr 03, 2018

Daniela Massiceti, N. Siddharth, Puneet K. Dokania, Philip H. S. Torr

Figure 1 for FlipDial: A Generative Model for Two-Way Visual Dialogue

Figure 2 for FlipDial: A Generative Model for Two-Way Visual Dialogue

Figure 3 for FlipDial: A Generative Model for Two-Way Visual Dialogue

Figure 4 for FlipDial: A Generative Model for Two-Way Visual Dialogue

Abstract:We present FlipDial, a generative model for visual dialogue that simultaneously plays the role of both participants in a visually-grounded dialogue. Given context in the form of an image and an associated caption summarising the contents of the image, FlipDial learns both to answer questions and put forward questions, capable of generating entire sequences of dialogue (question-answer pairs) which are diverse and relevant to the image. To do this, FlipDial relies on a simple but surprisingly powerful idea: it uses convolutional neural networks (CNNs) to encode entire dialogues directly, implicitly capturing dialogue context, and conditional VAEs to learn the generative model. FlipDial outperforms the state-of-the-art model in the sequential answering task (one-way visual dialogue) on the VisDial dataset by 5 points in Mean Rank using the generated answers. We are the first to extend this paradigm to full two-way visual dialogue, where our model is capable of generating both questions and answers in sequence based on a visual input, for which we propose a set of novel evaluation measures and metrics.

Via

Access Paper or Ask Questions

Learning to Compare: Relation Network for Few-Shot Learning

Mar 27, 2018

Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, Timothy M. Hospedales

Figure 1 for Learning to Compare: Relation Network for Few-Shot Learning

Figure 2 for Learning to Compare: Relation Network for Few-Shot Learning

Figure 3 for Learning to Compare: Relation Network for Few-Shot Learning

Figure 4 for Learning to Compare: Relation Network for Few-Shot Learning

Abstract:We present a conceptually simple, flexible, and general framework for few-shot learning, where a classifier must learn to recognise new classes given only few examples from each. Our method, called the Relation Network (RN), is trained end-to-end from scratch. During meta-learning, it learns to learn a deep distance metric to compare a small number of images within episodes, each of which is designed to simulate the few-shot setting. Once trained, a RN is able to classify images of new classes by computing relation scores between query images and the few examples of each new class without further updating the network. Besides providing improved performance on few-shot learning, our framework is easily extended to zero-shot learning. Extensive experiments on five benchmarks demonstrate that our simple approach provides a unified and effective approach for both of these two tasks.

* To appear in CVPR2018

Via

Access Paper or Ask Questions

Three Birds One Stone: A Unified Framework for Salient Object Segmentation, Edge Detection and Skeleton Extraction

Mar 27, 2018

Qibin Hou, Jiangjiang Liu, Ming-Ming Cheng, Ali Borji, Philip H. S. Torr

Figure 1 for Three Birds One Stone: A Unified Framework for Salient Object Segmentation, Edge Detection and Skeleton Extraction

Figure 2 for Three Birds One Stone: A Unified Framework for Salient Object Segmentation, Edge Detection and Skeleton Extraction

Figure 3 for Three Birds One Stone: A Unified Framework for Salient Object Segmentation, Edge Detection and Skeleton Extraction

Figure 4 for Three Birds One Stone: A Unified Framework for Salient Object Segmentation, Edge Detection and Skeleton Extraction

Abstract:In this paper, we aim at solving pixel-wise binary problems, including salient object segmentation, skeleton extraction, and edge detection, by introducing a unified architecture. Previous works have proposed tailored methods for solving each of the three tasks independently. Here, we show that these tasks share some similarities that can be exploited for developing a unified framework. In particular, we introduce a horizontal cascade, each component of which is densely connected to the outputs of previous component. Stringing these components together allows us to effectively exploit features across different levels hierarchically to effectively address the multiple pixel-wise binary regression tasks. To assess the performance of our proposed network on these tasks, we carry out exhaustive evaluations on multiple representative datasets. Although these tasks are inherently very different, we show that our unified approach performs very well on all of them and works far better than current single-purpose state-of-the-art methods. All the code in this paper will be publicly available.

* Submitted to ECCV2018

Via

Access Paper or Ask Questions

WebSeg: Learning Semantic Segmentation from Web Searches

Mar 27, 2018

Qibin Hou, Ming-Ming Cheng, Jiangjiang Liu, Philip H. S. Torr

Figure 1 for WebSeg: Learning Semantic Segmentation from Web Searches

Figure 2 for WebSeg: Learning Semantic Segmentation from Web Searches

Figure 3 for WebSeg: Learning Semantic Segmentation from Web Searches

Figure 4 for WebSeg: Learning Semantic Segmentation from Web Searches

Abstract:In this paper, we improve semantic segmentation by automatically learning from Flickr images associated with a particular keyword, without relying on any explicit user annotations, thus substantially alleviating the dependence on accurate annotations when compared to previous weakly supervised methods. To solve such a challenging problem, we leverage several low-level cues (such as saliency, edges, etc.) to help generate a proxy ground truth. Due to the diversity of web-crawled images, we anticipate a large amount of 'label noise' in which other objects might be present. We design an online noise filtering scheme which is able to deal with this label noise, especially in cluttered images. We use this filtering strategy as an auxiliary module to help assist the segmentation network in learning cleaner proxy annotations. Extensive experiments on the popular PASCAL VOC 2012 semantic segmentation benchmark show surprising good results in both our WebSeg (mIoU = 57.0%) and weakly supervised (mIoU = 63.3%) settings.

* Submitted to ECCV2018

Via

Access Paper or Ask Questions

Devon: Deformable Volume Network for Learning Optical Flow

Feb 20, 2018

Yao Lu, Jack Valmadre, Heng Wang, Juho Kannala, Mehrtash Harandi, Philip H. S. Torr

Figure 1 for Devon: Deformable Volume Network for Learning Optical Flow

Figure 2 for Devon: Deformable Volume Network for Learning Optical Flow

Figure 3 for Devon: Deformable Volume Network for Learning Optical Flow

Abstract:We propose a lightweight neural network model, Deformable Volume Network (Devon) for learning optical flow. Devon benefits from a multi-stage framework to iteratively refine its prediction. Each stage is by itself a neural network with an identical architecture. The optical flow between two stages is propagated with a newly proposed module, the deformable cost volume. The deformable cost volume does not distort the original images or their feature maps and therefore avoids the artifacts associated with warping, a common drawback in previous models. Devon only has one million parameters. Experiments show that Devon achieves comparable results to previous neural network models, despite of its small size.

Via

Access Paper or Ask Questions

Real-Time Dense Stereo Matching With ELAS on FPGA Accelerated Embedded Devices

Feb 20, 2018

Oscar Rahnama, Duncan Frost, Ondrej Miksik, Philip H. S. Torr

Figure 1 for Real-Time Dense Stereo Matching With ELAS on FPGA Accelerated Embedded Devices

Figure 2 for Real-Time Dense Stereo Matching With ELAS on FPGA Accelerated Embedded Devices

Figure 3 for Real-Time Dense Stereo Matching With ELAS on FPGA Accelerated Embedded Devices

Figure 4 for Real-Time Dense Stereo Matching With ELAS on FPGA Accelerated Embedded Devices

Abstract:For many applications in low-power real-time robotics, stereo cameras are the sensors of choice for depth perception as they are typically cheaper and more versatile than their active counterparts. Their biggest drawback, however, is that they do not directly sense depth maps; instead, these must be estimated through data-intensive processes. Therefore, appropriate algorithm selection plays an important role in achieving the desired performance characteristics. Motivated by applications in space and mobile robotics, we implement and evaluate a FPGA-accelerated adaptation of the ELAS algorithm. Despite offering one of the best trade-offs between efficiency and accuracy, ELAS has only been shown to run at 1.5-3 fps on a high-end CPU. Our system preserves all intriguing properties of the original algorithm, such as the slanted plane priors, but can achieve a frame rate of 47fps whilst consuming under 4W of power. Unlike previous FPGA based designs, we take advantage of both components on the CPU/FPGA System-on-Chip to showcase the strategy necessary to accelerate more complex and computationally diverse algorithms for such low power, real-time systems.

* 8 pages, 7 figures, 2 tables

Via

Access Paper or Ask Questions

Learning Disentangled Representations with Semi-Supervised Deep Generative Models

Nov 13, 2017

N. Siddharth, Brooks Paige, Jan-Willem van de Meent, Alban Desmaison, Noah D. Goodman, Pushmeet Kohli, Frank Wood, Philip H. S. Torr

Figure 1 for Learning Disentangled Representations with Semi-Supervised Deep Generative Models

Figure 2 for Learning Disentangled Representations with Semi-Supervised Deep Generative Models

Figure 3 for Learning Disentangled Representations with Semi-Supervised Deep Generative Models

Figure 4 for Learning Disentangled Representations with Semi-Supervised Deep Generative Models

Abstract:Variational autoencoders (VAEs) learn representations of data by jointly training a probabilistic encoder and decoder network. Typically these models encode all features of the data into a single variable. Here we are interested in learning disentangled representations that encode distinct aspects of the data into separate variables. We propose to learn such representations using model architectures that generalise from standard VAEs, employing a general graphical model structure in the encoder and decoder. This allows us to train partially-specified models that make relatively strong assumptions about a subset of interpretable variables and rely on the flexibility of neural networks to learn representations for the remaining variables. We further define a general objective for semi-supervised learning in this model class, which can be approximated using an importance sampling procedure. We evaluate our framework's ability to learn disentangled representations, both by qualitative exploration of its generative capacity, and quantitative evaluation of its discriminative ability on a variety of models and datasets.

* Accepted for publication at NIPS 2017

Via

Access Paper or Ask Questions

Holistic, Instance-Level Human Parsing

Sep 11, 2017

Qizhu Li, Anurag Arnab, Philip H. S. Torr

Figure 1 for Holistic, Instance-Level Human Parsing

Figure 2 for Holistic, Instance-Level Human Parsing

Figure 3 for Holistic, Instance-Level Human Parsing

Figure 4 for Holistic, Instance-Level Human Parsing

Abstract:Object parsing -- the task of decomposing an object into its semantic parts -- has traditionally been formulated as a category-level segmentation problem. Consequently, when there are multiple objects in an image, current methods cannot count the number of objects in the scene, nor can they determine which part belongs to which object. We address this problem by segmenting the parts of objects at an instance-level, such that each pixel in the image is assigned a part label, as well as the identity of the object it belongs to. Moreover, we show how this approach benefits us in obtaining segmentations at coarser granularities as well. Our proposed network is trained end-to-end given detections, and begins with a category-level segmentation module. Thereafter, a differentiable Conditional Random Field, defined over a variable number of instances for every input image, reasons about the identity of each part by associating it with a human detection. In contrast to other approaches, our method can handle the varying number of people in each image and our holistic network produces state-of-the-art results in instance-level part and human segmentation, together with competitive results in category-level part segmentation, all achieved by a single forward-pass through our neural network.

* Poster at BMVC 2017

Via

Access Paper or Ask Questions

Spatio-temporal Human Action Localisation and Instance Segmentation in Temporally Untrimmed Videos

Aug 06, 2017

Suman Saha, Gurkirt Singh, Michael Sapienza, Philip H. S. Torr, Fabio Cuzzolin

Figure 1 for Spatio-temporal Human Action Localisation and Instance Segmentation in Temporally Untrimmed Videos

Figure 2 for Spatio-temporal Human Action Localisation and Instance Segmentation in Temporally Untrimmed Videos

Figure 3 for Spatio-temporal Human Action Localisation and Instance Segmentation in Temporally Untrimmed Videos

Figure 4 for Spatio-temporal Human Action Localisation and Instance Segmentation in Temporally Untrimmed Videos

Abstract:Current state-of-the-art human action recognition is focused on the classification of temporally trimmed videos in which only one action occurs per frame. In this work we address the problem of action localisation and instance segmentation in which multiple concurrent actions of the same class may be segmented out of an image sequence. We cast the action tube extraction as an energy maximisation problem in which configurations of region proposals in each frame are assigned a cost and the best action tubes are selected via two passes of dynamic programming. One pass associates region proposals in space and time for each action category, and another pass is used to solve for the tube's temporal extent and to enforce a smooth label sequence through the video. In addition, by taking advantage of recent work on action foreground-background segmentation, we are able to associate each tube with class-specific segmentations. We demonstrate the performance of our algorithm on the challenging LIRIS-HARL dataset and achieve a new state-of-the-art result which is 14.3 times better than previous methods.

* Typos corrected

Via

Access Paper or Ask Questions

Discovering Class-Specific Pixels for Weakly-Supervised Semantic Segmentation

Jul 18, 2017

Arslan Chaudhry, Puneet K. Dokania, Philip H. S. Torr

Figure 1 for Discovering Class-Specific Pixels for Weakly-Supervised Semantic Segmentation

Figure 2 for Discovering Class-Specific Pixels for Weakly-Supervised Semantic Segmentation

Figure 3 for Discovering Class-Specific Pixels for Weakly-Supervised Semantic Segmentation

Figure 4 for Discovering Class-Specific Pixels for Weakly-Supervised Semantic Segmentation

Abstract:We propose an approach to discover class-specific pixels for the weakly-supervised semantic segmentation task. We show that properly combining saliency and attention maps allows us to obtain reliable cues capable of significantly boosting the performance. First, we propose a simple yet powerful hierarchical approach to discover the class-agnostic salient regions, obtained using a salient object detector, which otherwise would be ignored. Second, we use fully convolutional attention maps to reliably localize the class-specific regions in a given image. We combine these two cues to discover class-specific pixels which are then used as an approximate ground truth for training a CNN. While solving the weakly supervised semantic segmentation task, we ensure that the image-level classification task is also solved in order to enforce the CNN to assign at least one pixel to each object present in the image. Experimentally, on the PASCAL VOC12 val and test sets, we obtain the mIoU of 60.8% and 61.9%, achieving the performance gains of 5.1% and 5.2% compared to the published state-of-the-art results. The code is made publicly available.

* 28th British Machine Vision Conference (BMVC), 2017

Via

Access Paper or Ask Questions