Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chunhua Shen

The University of Adelaide

Deep Learning Features at Scale for Visual Place Recognition

Jan 18, 2017

Zetao Chen, Adam Jacobson, Niko Sunderhauf, Ben Upcroft, Lingqiao Liu, Chunhua Shen, Ian Reid, Michael Milford

Figure 1 for Deep Learning Features at Scale for Visual Place Recognition

Figure 2 for Deep Learning Features at Scale for Visual Place Recognition

Figure 3 for Deep Learning Features at Scale for Visual Place Recognition

Figure 4 for Deep Learning Features at Scale for Visual Place Recognition

Abstract:The success of deep learning techniques in the computer vision domain has triggered a range of initial investigations into their utility for visual place recognition, all using generic features from networks that were trained for other types of recognition tasks. In this paper, we train, at large scale, two CNN architectures for the specific place recognition task and employ a multi-scale feature encoding method to generate condition- and viewpoint-invariant features. To enable this training to occur, we have developed a massive Specific PlacEs Dataset (SPED) with hundreds of examples of place appearance change at thousands of different places, as opposed to the semantic place type datasets currently available. This new dataset enables us to set up a training regime that interprets place recognition as a classification problem. We comprehensively evaluate our trained networks on several challenging benchmark place recognition datasets and demonstrate that they achieve an average 10% increase in performance over other place recognition algorithms and pre-trained CNNs. By analyzing the network responses and their differences from pre-trained networks, we provide insights into what a network learns when training for place recognition, and what these results signify for future research in this area.

* 8 pages, 10 figures. Accepted by International Conference on Robotics and Automation (ICRA) 2017. This is the submitted version. The final published version may be slightly different

Via

Access Paper or Ask Questions

Compositional Model based Fisher Vector Coding for Image Classification

Jan 08, 2017

Lingqiao Liu, Peng Wang, Chunhua Shen, Lei Wang, Anton van den Hengel, Chao Wang, Heng Tao Shen

Figure 1 for Compositional Model based Fisher Vector Coding for Image Classification

Figure 2 for Compositional Model based Fisher Vector Coding for Image Classification

Figure 3 for Compositional Model based Fisher Vector Coding for Image Classification

Figure 4 for Compositional Model based Fisher Vector Coding for Image Classification

Abstract:Deriving from the gradient vector of a generative model of local features, Fisher vector coding (FVC) has been identified as an effective coding method for image classification. Most, if not all, FVC implementations employ the Gaussian mixture model (GMM) to depict the generation process of local features. However, the representative power of the GMM could be limited because it essentially assumes that local features can be characterized by a fixed number of feature prototypes and the number of prototypes is usually small in FVC. To handle this limitation, in this paper we break the convention which assumes that a local feature is drawn from one of few Gaussian distributions. Instead, we adopt a compositional mechanism which assumes that a local feature is drawn from a Gaussian distribution whose mean vector is composed as the linear combination of multiple key components and the combination weight is a latent random variable. In this way, we can greatly enhance the representative power of the generative model of FVC. To implement our idea, we designed two particular generative models with such a compositional mechanism.

* Fixed typos. 16 pages. Appearing in IEEE T. Pattern Analysis and Machine Intelligence (TPAMI)

Via

Access Paper or Ask Questions

Cross-convolutional-layer Pooling for Image Recognition

Dec 22, 2016

Lingqiao Liu, Chunhua Shen, Anton van den Hengel

Figure 1 for Cross-convolutional-layer Pooling for Image Recognition

Figure 2 for Cross-convolutional-layer Pooling for Image Recognition

Figure 3 for Cross-convolutional-layer Pooling for Image Recognition

Figure 4 for Cross-convolutional-layer Pooling for Image Recognition

Abstract:Recent studies have shown that a Deep Convolutional Neural Network (DCNN) pretrained on a large image dataset can be used as a universal image descriptor, and that doing so leads to impressive performance for a variety of image classification tasks. Most of these studies adopt activations from a single DCNN layer, usually the fully-connected layer, as the image representation. In this paper, we proposed a novel way to extract image representations from two consecutive convolutional layers: one layer is utilized for local feature extraction and the other serves as guidance to pool the extracted features. By taking different viewpoints of convolutional layers, we further develop two schemes to realize this idea. The first one directly uses convolutional layers from a DCNN. The second one applies the pretrained CNN on densely sampled image regions and treats the fully-connected activations of each image region as convolutional feature activations. We then train another convolutional layer on top of that as the pooling-guidance convolutional layer. By applying our method to three popular visual classification tasks, we find our first scheme tends to perform better on the applications which need strong discrimination on subtle object patterns within small regions while the latter excels in the cases that require discrimination on category-level patterns. Overall, the proposed method achieves superior performance over existing ways of extracting image representations from a DCNN.

* Fixed typos. Journal extension of arXiv:1411.7466. Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence

Via

Access Paper or Ask Questions

Image Captioning and Visual Question Answering Based on Attributes and External Knowledge

Dec 16, 2016

Qi Wu, Chunhua Shen, Anton van den Hengel, Peng Wang, Anthony Dick

Figure 1 for Image Captioning and Visual Question Answering Based on Attributes and External Knowledge

Figure 2 for Image Captioning and Visual Question Answering Based on Attributes and External Knowledge

Figure 3 for Image Captioning and Visual Question Answering Based on Attributes and External Knowledge

Figure 4 for Image Captioning and Visual Question Answering Based on Attributes and External Knowledge

Abstract:Much recent progress in Vision-to-Language problems has been achieved through a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This approach does not explicitly represent high-level semantic concepts, but rather seeks to progress directly from image features to text. In this paper we first propose a method of incorporating high-level concepts into the successful CNN-RNN approach, and show that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering. We further show that the same mechanism can be used to incorporate external knowledge, which is critically important for answering high level visual questions. Specifically, we design a visual question answering model that combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. It particularly allows questions to be asked about the contents of an image, even when the image itself does not contain a complete answer. Our final model achieves the best reported results on both image captioning and visual question answering on several benchmark datasets.

* 14 pages. arXiv admin note: text overlap with arXiv:1511.06973

Via

Access Paper or Ask Questions

The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions

Dec 16, 2016

Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel

Figure 1 for The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions

Figure 2 for The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions

Figure 3 for The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions

Figure 4 for The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions

Abstract:One of the most intriguing features of the Visual Question Answering (VQA) challenge is the unpredictability of the questions. Extracting the information required to answer them demands a variety of image operations from detection and counting, to segmentation and reconstruction. To train a method to perform even one of these operations accurately from {image,question,answer} tuples would be challenging, but to aim to achieve them all with a limited set of such training data seems ambitious at best. We propose here instead a more general and scalable approach which exploits the fact that very good methods to achieve these operations already exist, and thus do not need to be trained. Our method thus learns how to exploit a set of external off-the-shelf algorithms to achieve its goal, an approach that has something in common with the Neural Turing Machine. The core of our proposed method is a new co-attention model. In addition, the proposed approach generates human-readable reasons for its decision, and can still be trained end-to-end without ground truth reasons being given. We demonstrate the effectiveness on two publicly available datasets, Visual Genome and VQA, and show that it produces the state-of-the-art results in both cases.

Via

Access Paper or Ask Questions

From Motion Blur to Motion Flow: a Deep Learning Solution for Removing Heterogeneous Motion Blur

Dec 08, 2016

Dong Gong, Jie Yang, Lingqiao Liu, Yanning Zhang, Ian Reid, Chunhua Shen, Anton van den Hengel, Qinfeng Shi

Figure 1 for From Motion Blur to Motion Flow: a Deep Learning Solution for Removing Heterogeneous Motion Blur

Figure 2 for From Motion Blur to Motion Flow: a Deep Learning Solution for Removing Heterogeneous Motion Blur

Figure 3 for From Motion Blur to Motion Flow: a Deep Learning Solution for Removing Heterogeneous Motion Blur

Figure 4 for From Motion Blur to Motion Flow: a Deep Learning Solution for Removing Heterogeneous Motion Blur

Abstract:Removing pixel-wise heterogeneous motion blur is challenging due to the ill-posed nature of the problem. The predominant solution is to estimate the blur kernel by adding a prior, but the extensive literature on the subject indicates the difficulty in identifying a prior which is suitably informative, and general. Rather than imposing a prior based on theory, we propose instead to learn one from the data. Learning a prior over the latent image would require modeling all possible image content. The critical observation underpinning our approach is thus that learning the motion flow instead allows the model to focus on the cause of the blur, irrespective of the image content. This is a much easier learning task, but it also avoids the iterative process through which latent image priors are typically applied. Our approach directly estimates the motion flow from the blurred image through a fully-convolutional deep neural network (FCN) and recovers the unblurred image from the estimated motion flow. Our FCN is the first universal end-to-end mapping from the blurred image to the dense motion flow. To train the FCN, we simulate motion flows to generate synthetic blurred-image-motion-flow pairs thus avoiding the need for human labeling. Extensive experiments on challenging realistic blurred images demonstrate that the proposed method outperforms the state-of-the-art.

Via

Access Paper or Ask Questions

Wider or Deeper: Revisiting the ResNet Model for Visual Recognition

Nov 30, 2016

Zifeng Wu, Chunhua Shen, Anton van den Hengel

Figure 1 for Wider or Deeper: Revisiting the ResNet Model for Visual Recognition

Figure 2 for Wider or Deeper: Revisiting the ResNet Model for Visual Recognition

Figure 3 for Wider or Deeper: Revisiting the ResNet Model for Visual Recognition

Figure 4 for Wider or Deeper: Revisiting the ResNet Model for Visual Recognition

Abstract:The trend towards increasingly deep neural networks has been driven by a general observation that increasing depth increases the performance of a network. Recently, however, evidence has been amassing that simply increasing depth may not be the best way to increase performance, particularly given other limitations. Investigations into deep residual networks have also suggested that they may not in fact be operating as a single deep network, but rather as an ensemble of many relatively shallow networks. We examine these issues, and in doing so arrive at a new interpretation of the unravelled view of deep residual networks which explains some of the behaviours that have been observed experimentally. As a result, we are able to derive a new, shallower, architecture of residual networks which significantly outperforms much deeper models such as ResNet-200 on the ImageNet classification dataset. We also show that this performance is transferable to other problem domains by developing a semantic segmentation approach which outperforms the state-of-the-art by a remarkable margin on datasets including PASCAL VOC, PASCAL Context, and Cityscapes. The architecture that we propose thus outperforms its comparators, including very deep ResNets, and yet is more efficient in memory use and sometimes also in training time. The code and models are available at https://github.com/itijyou/ademxapp

* Code available at: https://github.com/itijyou/ademxapp

Via

Access Paper or Ask Questions

Sequential Person Recognition in Photo Albums with a Recurrent Network

Nov 30, 2016

Yao Li, Guosheng Lin, Bohan Zhuang, Lingqiao Liu, Chunhua Shen, Anton van den Hengel

Figure 1 for Sequential Person Recognition in Photo Albums with a Recurrent Network

Figure 2 for Sequential Person Recognition in Photo Albums with a Recurrent Network

Figure 3 for Sequential Person Recognition in Photo Albums with a Recurrent Network

Figure 4 for Sequential Person Recognition in Photo Albums with a Recurrent Network

Abstract:Recognizing the identities of people in everyday photos is still a very challenging problem for machine vision, due to non-frontal faces, changes in clothing, location, lighting and similar. Recent studies have shown that rich relational information between people in the same photo can help in recognizing their identities. In this work, we propose to model the relational information between people as a sequence prediction task. At the core of our work is a novel recurrent network architecture, in which relational information between instances' labels and appearance are modeled jointly. In addition to relational cues, scene context is incorporated in our sequence prediction model with no additional cost. In this sense, our approach is a unified framework for modeling both contextual cues and visual appearance of person instances. Our model is trained end-to-end with a sequence of annotated instances in a photo as inputs, and a sequence of corresponding labels as targets. We demonstrate that this simple but elegant formulation achieves state-of-the-art performance on the newly released People In Photo Albums (PIPA) dataset.

Via

Access Paper or Ask Questions

Attend in groups: a weakly-supervised deep learning framework for learning from web data

Nov 30, 2016

Bohan Zhuang, Lingqiao Liu, Yao Li, Chunhua Shen, Ian Reid

Figure 1 for Attend in groups: a weakly-supervised deep learning framework for learning from web data

Figure 2 for Attend in groups: a weakly-supervised deep learning framework for learning from web data

Figure 3 for Attend in groups: a weakly-supervised deep learning framework for learning from web data

Figure 4 for Attend in groups: a weakly-supervised deep learning framework for learning from web data

Abstract:Large-scale datasets have driven the rapid development of deep neural networks for visual recognition. However, annotating a massive dataset is expensive and time-consuming. Web images and their labels are, in comparison, much easier to obtain, but direct training on such automatically harvested images can lead to unsatisfactory performance, because the noisy labels of Web images adversely affect the learned recognition models. To address this drawback we propose an end-to-end weakly-supervised deep learning framework which is robust to the label noise in Web images. The proposed framework relies on two unified strategies -- random grouping and attention -- to effectively reduce the negative impact of noisy web image annotations. Specifically, random grouping stacks multiple images into a single training instance and thus increases the labeling accuracy at the instance level. Attention, on the other hand, suppresses the noisy signals from both incorrectly labeled images and less discriminative image regions. By conducting intensive experiments on two challenging datasets, including a newly collected fine-grained dataset with Web images of different car models, the superior performance of the proposed methods over competitive baselines is clearly demonstrated.

Via

Access Paper or Ask Questions

RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

Nov 25, 2016

Guosheng Lin, Anton Milan, Chunhua Shen, Ian Reid

Figure 1 for RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

Figure 2 for RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

Figure 3 for RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

Figure 4 for RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

Abstract:Recently, very deep convolutional neural networks (CNNs) have shown outstanding performance in object recognition and have also been the first choice for dense classification problems such as semantic segmentation. However, repeated subsampling operations like pooling or convolution striding in deep CNNs lead to a significant decrease in the initial image resolution. Here, we present RefineNet, a generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections. In this way, the deeper layers that capture high-level semantic features can be directly refined using fine-grained features from earlier convolutions. The individual components of RefineNet employ residual connections following the identity mapping mindset, which allows for effective end-to-end training. Further, we introduce chained residual pooling, which captures rich background context in an efficient manner. We carry out comprehensive experiments and set new state-of-the-art results on seven public datasets. In particular, we achieve an intersection-over-union score of 83.4 on the challenging PASCAL VOC 2012 dataset, which is the best reported result to date.

Via

Access Paper or Ask Questions