Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ling Shao

Terminus Group, Beijing, China

Dynamically Visual Disambiguation of Keyword-based Image Search

May 27, 2019

Yazhou Yao, Zeren Sun, Fumin Shen, Li Liu, Limin Wang, Fan Zhu, Lizhong Ding, Gangshan Wu, Ling Shao

Figure 1 for Dynamically Visual Disambiguation of Keyword-based Image Search

Figure 2 for Dynamically Visual Disambiguation of Keyword-based Image Search

Figure 3 for Dynamically Visual Disambiguation of Keyword-based Image Search

Figure 4 for Dynamically Visual Disambiguation of Keyword-based Image Search

Abstract:Due to the high cost of manual annotation, learning directly from the web has attracted broad attention. One issue that limits their performance is the problem of visual polysemy. To address this issue, we present an adaptive multi-model framework that resolves polysemy by visual disambiguation. Compared to existing methods, the primary advantage of our approach lies in that our approach can adapt to the dynamic changes in the search results. Our proposed framework consists of two major steps: we first discover and dynamically select the text queries according to the image search results, then we employ the proposed saliency-guided deep multi-instance learning network to remove outliers and learn classification models for visual disambiguation. Extensive experiments demonstrate the superiority of our proposed approach.

* Accepted by International Joint Conference on Artificial Intelligence (IJCAI), 2019

Via

Access Paper or Ask Questions

Out-of-Distribution Detection for Generalized Zero-Shot Action Recognition

May 06, 2019

Devraj Mandal, Sanath Narayan, Saikumar Dwivedi, Vikram Gupta, Shuaib Ahmed, Fahad Shahbaz Khan, Ling Shao

Figure 1 for Out-of-Distribution Detection for Generalized Zero-Shot Action Recognition

Figure 2 for Out-of-Distribution Detection for Generalized Zero-Shot Action Recognition

Figure 3 for Out-of-Distribution Detection for Generalized Zero-Shot Action Recognition

Figure 4 for Out-of-Distribution Detection for Generalized Zero-Shot Action Recognition

Abstract:Generalized zero-shot action recognition is a challenging problem, where the task is to recognize new action categories that are unavailable during the training stage, in addition to the seen action categories. Existing approaches suffer from the inherent bias of the learned classifier towards the seen action categories. As a consequence, unseen category samples are incorrectly classified as belonging to one of the seen action categories. In this paper, we set out to tackle this issue by arguing for a separate treatment of seen and unseen action categories in generalized zero-shot action recognition. We introduce an out-of-distribution detector that determines whether the video features belong to a seen or unseen action category. To train our out-of-distribution detector, video features for unseen action categories are synthesized using generative adversarial networks trained on seen action category features. To the best of our knowledge, we are the first to propose an out-of-distribution detector based GZSL framework for action recognition in videos. Experiments are performed on three action recognition datasets: Olympic Sports, HMDB51 and UCF101. For generalized zero-shot action recognition, our proposed approach outperforms the baseline (f-CLSWGAN) with absolute gains (in classification accuracy) of 7.0%, 3.4%, and 4.9%, respectively, on these datasets.

* 10 pages, 3 figures, 6 Tables. To appear in the proceedings of CVPR 2019

Via

Access Paper or Ask Questions

Interpretable and Generalizable Deep Image Matching with Adaptive Convolutions

Apr 23, 2019

Shengcai Liao, Ling Shao

Figure 1 for Interpretable and Generalizable Deep Image Matching with Adaptive Convolutions

Figure 2 for Interpretable and Generalizable Deep Image Matching with Adaptive Convolutions

Figure 3 for Interpretable and Generalizable Deep Image Matching with Adaptive Convolutions

Figure 4 for Interpretable and Generalizable Deep Image Matching with Adaptive Convolutions

Abstract:For image matching tasks, like face recognition and person re-identification, existing deep networks often focus on representation learning. However, without domain adaptation or transfer learning, the learned model is fixed as is, which is not adaptable to handle various unseen scenarios. In this paper, beyond representation learning, we consider how to formulate image matching directly in deep feature maps. We treat image matching as finding local correspondences in feature maps, and construct adaptive convolution kernels on the fly to achieve local matching. In this way, the matching process and result is interpretable, and this explicit matching is more generalizable than representation features to unseen scenarios, such as unknown misalignments, pose or viewpoint changes. To facilitate end-to-end training of such an image matching architecture, we further build a class memory module to cache feature maps of the most recent samples of each class, so as to compute image matching losses for metric learning. The proposed method is preliminarily validated on the person re-identification task. Through direct cross-dataset evaluation without further transfer learning, it achieves better results than many transfer learning methods. Besides, a model-free temporal cooccurrence based score weighting method is proposed, which improves the performance to a further extent, resulting in state-of-the-art results in cross-dataset evaluation.

Via

Access Paper or Ask Questions

Learning Digital Camera Pipeline for Extreme Low-Light Imaging

Apr 11, 2019

Syed Waqas Zamir, Aditya Arora, Salman Khan, Fahad Shahbaz Khan, Ling Shao

Figure 1 for Learning Digital Camera Pipeline for Extreme Low-Light Imaging

Figure 2 for Learning Digital Camera Pipeline for Extreme Low-Light Imaging

Figure 3 for Learning Digital Camera Pipeline for Extreme Low-Light Imaging

Figure 4 for Learning Digital Camera Pipeline for Extreme Low-Light Imaging

Abstract:In low-light conditions, a conventional camera imaging pipeline produces sub-optimal images that are usually dark and noisy due to a low photon count and low signal-to-noise ratio (SNR). We present a data-driven approach that learns the desired properties of well-exposed images and reflects them in images that are captured in extremely low ambient light environments, thereby significantly improving the visual quality of these low-light images. We propose a new loss function that exploits the characteristics of both pixel-wise and perceptual metrics, enabling our deep neural network to learn the camera processing pipeline to transform the short-exposure, low-light RAW sensor data to well-exposed sRGB images. The results show that our method outperforms the state-of-the-art according to psychophysical tests as well as pixel-wise standard metrics and recent learning-based perceptual image quality measures.

Via

Access Paper or Ask Questions

Adversarial Defense by Restricting the Hidden Space of Deep Neural Networks

Apr 07, 2019

Aamir Mustafa, Salman Khan, Munawar Hayat, Roland Goecke, Jianbing Shen, Ling Shao

Figure 1 for Adversarial Defense by Restricting the Hidden Space of Deep Neural Networks

Figure 2 for Adversarial Defense by Restricting the Hidden Space of Deep Neural Networks

Figure 3 for Adversarial Defense by Restricting the Hidden Space of Deep Neural Networks

Figure 4 for Adversarial Defense by Restricting the Hidden Space of Deep Neural Networks

Abstract:Deep neural networks are vulnerable to adversarial attacks, which can fool them by adding minuscule perturbations to the input images. The robustness of existing defenses suffers greatly under white-box attack settings, where an adversary has full knowledge about the network and can iterate several times to find strong perturbations. We observe that the main reason for the existence of such perturbations is the close proximity of different class samples in the learned feature space. This allows model decisions to be totally changed by adding an imperceptible perturbation in the inputs. To counter this, we propose to class-wise disentangle the intermediate feature representations of deep networks. Specifically, we force the features for each class to lie inside a convex polytope that is maximally separated from the polytopes of other classes. In this manner, the network is forced to learn distinct and distant decision regions for each class. We observe that this simple constraint on the features greatly enhances the robustness of learned models, even against the strongest white-box attacks, without degrading the classification performance on clean images. We report extensive evaluations in both black-box and white-box attack scenarios and show significant gains in comparison to state-of-the art defenses.

Via

Access Paper or Ask Questions

Iterative Normalization: Beyond Standardization towards Efficient Whitening

Apr 06, 2019

Lei Huang, Yi Zhou, Fan Zhu, Li Liu, Ling Shao

Figure 1 for Iterative Normalization: Beyond Standardization towards Efficient Whitening

Figure 2 for Iterative Normalization: Beyond Standardization towards Efficient Whitening

Figure 3 for Iterative Normalization: Beyond Standardization towards Efficient Whitening

Figure 4 for Iterative Normalization: Beyond Standardization towards Efficient Whitening

Abstract:Batch Normalization (BN) is ubiquitously employed for accelerating neural network training and improving the generalization capability by performing standardization within mini-batches. Decorrelated Batch Normalization (DBN) further boosts the above effectiveness by whitening. However, DBN relies heavily on either a large batch size, or eigen-decomposition that suffers from poor efficiency on GPUs. We propose Iterative Normalization (IterNorm), which employs Newton's iterations for much more efficient whitening, while simultaneously avoiding the eigen-decomposition. Furthermore, we develop a comprehensive study to show IterNorm has better trade-off between optimization and generalization, with theoretical and experimental support. To this end, we exclusively introduce Stochastic Normalization Disturbance (SND), which measures the inherent stochastic uncertainty of samples when applied to normalization operations. With the support of SND, we provide natural explanations to several phenomena from the perspective of optimization, e.g., why group-wise whitening of DBN generally outperforms full-whitening and why the accuracy of BN degenerates with reduced batch sizes. We demonstrate the consistently improved performance of IterNorm with extensive experiments on CIFAR-10 and ImageNet over BN and DBN.

* Accepted to CVPR 2019. The Code is available at https://github.com/huangleiBuaa/IterNorm

Via

Access Paper or Ask Questions

Few-Shot Deep Adversarial Learning for Video-based Person Re-identification

Mar 29, 2019

Lin Wu, Yang Wang, Hongzhi Yin, Meng Wang, Ling Shao, B. C. Lovell

Figure 1 for Few-Shot Deep Adversarial Learning for Video-based Person Re-identification

Figure 2 for Few-Shot Deep Adversarial Learning for Video-based Person Re-identification

Figure 3 for Few-Shot Deep Adversarial Learning for Video-based Person Re-identification

Figure 4 for Few-Shot Deep Adversarial Learning for Video-based Person Re-identification

Abstract:Video-based person re-identification (re-ID) refers to matching people across camera views from arbitrary unaligned video footages. Existing methods rely on supervision signals to optimise a projected space under which the distances between inter/intra-videos are maximised/minimised. However, this demands exhaustively labelling people across camera views, rendering them unable to be scaled in large networked cameras. Also, it is noticed that learning effective video representations with view invariance is not explicitly addressed for which features exhibit different distributions otherwise. Thus, matching videos for person re-ID demands flexible models to capture the dynamics in time-series observations and learn view-invariant representations with access to limited labeled training samples. In this paper, we propose a novel few-shot deep learning approach to video-based person re-ID, to learn comparable representations that are discriminative and view-invariant. The proposed method is developed on the variational recurrent neural networks (VRNNs) and trained adversarially to produce latent variables with temporal dependencies that are highly discriminative yet view-invariant in matching persons. Through extensive experiments conducted on three benchmark datasets, we empirically show the capability of our method in creating view-invariant temporal features and state-of-the-art performance achieved by our method.

* Minor Revision for IEEE Trans. Image Processing

Via

Access Paper or Ask Questions

Pixel-aware Deep Function-mixture Network for Spectral Super-Resolution

Mar 24, 2019

Lei Zhang, Zhiqiang Lang, Peng Wang, Wei Wei, Shengcai Liao, Ling Shao, Yanning Zhang

Figure 1 for Pixel-aware Deep Function-mixture Network for Spectral Super-Resolution

Figure 2 for Pixel-aware Deep Function-mixture Network for Spectral Super-Resolution

Figure 3 for Pixel-aware Deep Function-mixture Network for Spectral Super-Resolution

Figure 4 for Pixel-aware Deep Function-mixture Network for Spectral Super-Resolution

Abstract:Spectral super-resolution (SSR) aims at generating a hyperspectral image (HSI) from a given RGB image. Recently, a promising direction for SSR is to learn a complicated mapping function from the RGB image to the HSI counterpart using a deep convolutional neural network. This essentially involves mapping the RGB context within a size-specific receptive field centered at each pixel to its spectrum in the HSI. The focus thereon is to appropriately determine the receptive field size and establish the mapping function from RGB context to the corresponding spectrum. Due to their differences in category or spatial position, pixels in HSIs often require different-sized receptive fields and distinct mapping functions. However, few efforts have been invested to explicitly exploit this prior. To address this problem, we propose a pixel-aware deep function-mixture network for SSR, which is composed of a new class of modules, termed function-mixture (FM) blocks. Each FM block is equipped with some basis functions, i.e., parallel subnets of different-sized receptive fields. Besides, it incorporates an extra subnet as a mixing function to generate pixel-wise weights, and then linearly mixes the outputs of all basis functions with those generated weights. This enables us to pixel-wisely determine the receptive field size and the mapping function. Moreover, we stack several such FM blocks to further increase the flexibility of the network in learning the pixel-wise mapping. To encourage feature reuse, intermediate features generated by the FM blocks are fused in late stage, which proves to be effective for boosting the SSR performance. Experimental results on three benchmark HSI datasets demonstrate the superiority of the proposed method.

Via

Access Paper or Ask Questions

Object Counting and Instance Segmentation with Image-level Supervision

Mar 06, 2019

Hisham Cholakkal, Guolei Sun, Fahad Shahbaz Khan, Ling Shao

Figure 1 for Object Counting and Instance Segmentation with Image-level Supervision

Figure 2 for Object Counting and Instance Segmentation with Image-level Supervision

Figure 3 for Object Counting and Instance Segmentation with Image-level Supervision

Figure 4 for Object Counting and Instance Segmentation with Image-level Supervision

Abstract:Common object counting in a natural scene is a challenging problem in computer vision with numerous real-world applications. Existing image-level supervised common object counting approaches only predict the global object count and rely on additional instance-level supervision to also determine object locations. We propose an image-level supervised approach that provides both the global object count and the spatial distribution of object instances by constructing an object category density map. Motivated by psychological studies, we further reduce image-level supervision using a limited object count information (up to four). To the best of our knowledge, we are the first to propose image-level supervised density map estimation for common object counting and demonstrate its effectiveness in image-level supervised instance segmentation. Comprehensive experiments are performed on the PASCAL VOC and COCO datasets. Our approach outperforms existing methods, including those using instance-level supervision, on both datasets for common object counting. Moreover, our approach improves state-of-the-art image-level supervised instance segmentation with a relative gain of 17.8% in terms of average best overlap, on the PASCAL VOC 2012 dataset.

* To appear in CVPR 2019

Via

Access Paper or Ask Questions

Crowd Counting and Density Estimation by Trellis Encoder-Decoder Network

Mar 03, 2019

Xiaolong Jiang, Zehao Xiao, Baochang Zhang, Xiantong Zhen, Xianbin Cao, David Doermann, Ling Shao

Figure 1 for Crowd Counting and Density Estimation by Trellis Encoder-Decoder Network

Figure 2 for Crowd Counting and Density Estimation by Trellis Encoder-Decoder Network

Figure 3 for Crowd Counting and Density Estimation by Trellis Encoder-Decoder Network

Figure 4 for Crowd Counting and Density Estimation by Trellis Encoder-Decoder Network

Abstract:Crowd counting has recently attracted increasing interest in computer vision but remains a challenging problem. In this paper, we propose a trellis encoder-decoder network (TEDnet) for crowd counting, which focuses on generating high-quality density estimation maps. The major contributions are four-fold. First, we develop a new trellis architecture that incorporates multiple decoding paths to hierarchically aggregate features at different encoding stages, which can handle large variations of objects. Second, we design dense skip connections interleaved across paths to facilitate sufficient multi-scale feature fusions and to absorb the supervision information. Third, we propose a new combinatorial loss to enforce local coherence and spatial correlation in density maps. By distributedly imposing this combinatorial loss on intermediate outputs, gradient vanishing can be largely alleviated for better back-propagation and faster convergence. Finally, our TEDnet achieves new state-of-the art performance on four benchmarks, with an improvement up to 14% in terms of MAE.

Via

Access Paper or Ask Questions