Alert button
Picture for Liang Lin

Liang Lin

Alert button

FRAME Revisited: An Interpretation View Based on Particle Evolution

Dec 04, 2018
Xu Cai, Yang Wu, Guanbin Li, Ziliang Chen, Liang Lin

Figure 1 for FRAME Revisited: An Interpretation View Based on Particle Evolution
Figure 2 for FRAME Revisited: An Interpretation View Based on Particle Evolution
Figure 3 for FRAME Revisited: An Interpretation View Based on Particle Evolution
Figure 4 for FRAME Revisited: An Interpretation View Based on Particle Evolution

FRAME (Filters, Random fields, And Maximum Entropy) is an energy-based descriptive model that synthesizes visual realism by capturing mutual patterns from structural input signals. The maximum likelihood estimation (MLE) is applied by default, yet conventionally causes the unstable training energy that wrecks the generated structures, which remains unexplained. In this paper, we provide a new theoretical insight to analyze FRAME, from a perspective of particle physics ascribing the weird phenomenon to KL-vanishing issue. In order to stabilize the energy dissipation, we propose an alternative Wasserstein distance in discrete time based on the conclusion that the Jordan-Kinderlehrer-Otto (JKO) discrete flow approximates KL discrete flow when the time step size tends to 0. Besides, this metric can still maintain the model's statistical consistency. Quantitative and qualitative experiments have been respectively conducted on several widely used datasets. The empirical studies have evidenced the effectiveness and superiority of our method.

Viaarxiv icon

FANet: Quality-Aware Feature Aggregation Network for RGB-T Tracking

Nov 24, 2018
Yabin Zhu, Chenglong Li, Yijuan Lu, Liang Lin, Bin Luo, Jin Tang

Figure 1 for FANet: Quality-Aware Feature Aggregation Network for RGB-T Tracking
Figure 2 for FANet: Quality-Aware Feature Aggregation Network for RGB-T Tracking
Figure 3 for FANet: Quality-Aware Feature Aggregation Network for RGB-T Tracking
Figure 4 for FANet: Quality-Aware Feature Aggregation Network for RGB-T Tracking

This paper investigates how to perform robust visual tracking in adverse and challenging conditions using complementary visual and thermal infrared data (RGB-T tracking). We propose a novel deep network architecture "quality-aware Feature Aggregation Network (FANet)" to achieve quality-aware aggregations of both hierarchical features and multimodal information for robust online RGB-T tracking. Unlike existing works that directly concatenate hierarchical deep features, our FANet learns the layer weights to adaptively aggregate them to handle the challenge of significant appearance changes caused by deformation, abrupt motion, background clutter and occlusion within each modality. Moreover, we employ the operations of max pooling, interpolation upsampling and convolution to transform these hierarchical and multi-resolution features into a uniform space at the same resolution for more effective feature aggregation. In different modalities, we elaborately design a multimodal aggregation sub-network to integrate all modalities collaboratively based on the predicted reliability degrees. Extensive experiments on large-scale benchmark datasets demonstrate that our FANet significantly outperforms other state-of-the-art RGB-T tracking methods.

Viaarxiv icon

Unsupervised Domain Adaptation: An Adaptive Feature Norm Approach

Nov 19, 2018
Ruijia Xu, Guanbin Li, Jihan Yang, Liang Lin

Figure 1 for Unsupervised Domain Adaptation: An Adaptive Feature Norm Approach
Figure 2 for Unsupervised Domain Adaptation: An Adaptive Feature Norm Approach
Figure 3 for Unsupervised Domain Adaptation: An Adaptive Feature Norm Approach
Figure 4 for Unsupervised Domain Adaptation: An Adaptive Feature Norm Approach

Unsupervised domain adaptation aims to mitigate the domain shift when transferring knowledge from a supervised source domain to an unsupervised target domain. Adversarial Feature Alignment has been successfully explored to minimize the domain discrepancy. However, existing methods are usually struggling to optimize mixed learning objectives and vulnerable to negative transfer when two domains do not share the identical label space. In this paper, we empirically reveal that the erratic discrimination of target domain mainly reflects in its much lower feature norm value with respect to that of the source domain. We present a non-parametric Adaptive Feature Norm AFN approach, which is independent of the association between label spaces of the two domains. We demonstrate that adapting feature norms of source and target domains to achieve equilibrium over a large range of values can result in significant domain transfer gains. Without bells and whistles but a few lines of code, our method largely lifts the discrimination of target domain (23.7\% from the Source Only in VisDA2017) and achieves the new state of the art under the vanilla setting. Furthermore, as our approach does not require to deliberately align the feature distributions, it is robust to negative transfer and can outperform the existing approaches under the partial setting by an extremely large margin (9.8\% on Office-Home and 14.1\% on VisDA2017). Code is available at https://github.com/jihanyang/AFN. We are responsible for the reproducibility of our method.

Viaarxiv icon

Cross-Modal Attentional Context Learning for RGB-D Object Detection

Oct 30, 2018
Guanbin Li, Yukang Gan, Hejun Wu, Nong Xiao, Liang Lin

Figure 1 for Cross-Modal Attentional Context Learning for RGB-D Object Detection
Figure 2 for Cross-Modal Attentional Context Learning for RGB-D Object Detection
Figure 3 for Cross-Modal Attentional Context Learning for RGB-D Object Detection
Figure 4 for Cross-Modal Attentional Context Learning for RGB-D Object Detection

Recognizing objects from simultaneously sensed photometric (RGB) and depth channels is a fundamental yet practical problem in many machine vision applications such as robot grasping and autonomous driving. In this paper, we address this problem by developing a Cross-Modal Attentional Context (CMAC) learning framework, which enables the full exploitation of the context information from both RGB and depth data. Compared to existing RGB-D object detection frameworks, our approach has several appealing properties. First, it consists of an attention-based global context model for exploiting adaptive contextual information and incorporating this information into a region-based CNN (e.g., Fast RCNN) framework to achieve improved object detection performance. Second, our CMAC framework further contains a fine-grained object part attention module to harness multiple discriminative object parts inside each possible object region for superior local feature representation. While greatly improving the accuracy of RGB-D object detection, the effective cross-modal information fusion as well as attentional context modeling in our proposed model provide an interpretable visualization scheme. Experimental results demonstrate that the proposed method significantly improves upon the state of the art on all public benchmarks.

* Accept as a regular paper to IEEE Transactions on Image Processing 
Viaarxiv icon

Hybrid Knowledge Routed Modules for Large-scale Object Detection

Oct 30, 2018
Chenhan Jiang, Hang Xu, Xiangdan Liang, Liang Lin

Figure 1 for Hybrid Knowledge Routed Modules for Large-scale Object Detection
Figure 2 for Hybrid Knowledge Routed Modules for Large-scale Object Detection
Figure 3 for Hybrid Knowledge Routed Modules for Large-scale Object Detection
Figure 4 for Hybrid Knowledge Routed Modules for Large-scale Object Detection

The dominant object detection approaches treat the recognition of each region separately and overlook crucial semantic correlations between objects in one scene. This paradigm leads to substantial performance drop when facing heavy long-tail problems, where very few samples are available for rare classes and plenty of confusing categories exists. We exploit diverse human commonsense knowledge for reasoning over large-scale object categories and reaching semantic coherency within one image. Particularly, we present Hybrid Knowledge Routed Modules (HKRM) that incorporates the reasoning routed by two kinds of knowledge forms: an explicit knowledge module for structured constraints that are summarized with linguistic knowledge (e.g. shared attributes, relationships) about concepts; and an implicit knowledge module that depicts some implicit constraints (e.g. common spatial layouts). By functioning over a region-to-region graph, both modules can be individualized and adapted to coordinate with visual patterns in each image, guided by specific knowledge forms. HKRM are light-weight, general-purpose and extensible by easily incorporating multiple knowledge to endow any detection networks the ability of global semantic reasoning. Experiments on large-scale object detection benchmarks show HKRM obtains around 34.5% improvement on VisualGenome (1000 categories) and 30.4% on ADE in terms of mAP. Codes and trained model can be found in https://github.com/chanyn/HKRM.

* 9 pages, 5 figures 
Viaarxiv icon

Learning Deep Representations for Semantic Image Parsing: a Comprehensive Overview

Oct 10, 2018
Lili Huang, Jiefeng Peng, Ruimao Zhang, Guanbin Li, Liang Lin

Figure 1 for Learning Deep Representations for Semantic Image Parsing: a Comprehensive Overview
Figure 2 for Learning Deep Representations for Semantic Image Parsing: a Comprehensive Overview
Figure 3 for Learning Deep Representations for Semantic Image Parsing: a Comprehensive Overview
Figure 4 for Learning Deep Representations for Semantic Image Parsing: a Comprehensive Overview

Semantic image parsing, which refers to the process of decomposing images into semantic regions and constructing the structure representation of the input, has recently aroused widespread interest in the field of computer vision. The recent application of deep representation learning has driven this field into a new stage of development. In this paper, we summarize three aspects of the progress of research on semantic image parsing, i.e., category-level semantic segmentation, instance-level semantic segmentation, and beyond segmentation. Specifically, we first review the general frameworks for each task and introduce the relevant variants. The advantages and limitations of each method are also discussed. Moreover, we present a comprehensive comparison of different benchmark datasets and evaluation metrics. Finally, we explore the future trends and challenges of semantic image parsing.

Viaarxiv icon

PIRM Challenge on Perceptual Image Enhancement on Smartphones: Report

Oct 03, 2018
Andrey Ignatov, Radu Timofte, Thang Van Vu, Tung Minh Luu, Trung X Pham, Cao Van Nguyen, Yongwoo Kim, Jae-Seok Choi, Munchurl Kim, Jie Huang, Jiewen Ran, Chen Xing, Xingguang Zhou, Pengfei Zhu, Mingrui Geng, Yawei Li, Eirikur Agustsson, Shuhang Gu, Luc Van Gool, Etienne de Stoutz, Nikolay Kobyshev, Kehui Nie, Yan Zhao, Gen Li, Tong Tong, Qinquan Gao, Liu Hanwen, Pablo Navarrete Michelini, Zhu Dan, Hu Fengshuo, Zheng Hui, Xiumei Wang, Lirui Deng, Rang Meng, Jinghui Qin, Yukai Shi, Wushao Wen, Liang Lin, Ruicheng Feng, Shixiang Wu, Chao Dong, Yu Qiao, Subeesh Vasu, Nimisha Thekke Madam, Praveen Kandula, A. N. Rajagopalan, Jie Liu, Cheolkon Jung

Figure 1 for PIRM Challenge on Perceptual Image Enhancement on Smartphones: Report
Figure 2 for PIRM Challenge on Perceptual Image Enhancement on Smartphones: Report
Figure 3 for PIRM Challenge on Perceptual Image Enhancement on Smartphones: Report
Figure 4 for PIRM Challenge on Perceptual Image Enhancement on Smartphones: Report

This paper reviews the first challenge on efficient perceptual image enhancement with the focus on deploying deep learning models on smartphones. The challenge consisted of two tracks. In the first one, participants were solving the classical image super-resolution problem with a bicubic downscaling factor of 4. The second track was aimed at real-world photo enhancement, and the goal was to map low-quality photos from the iPhone 3GS device to the same photos captured with a DSLR camera. The target metric used in this challenge combined the runtime, PSNR scores and solutions' perceptual results measured in the user study. To ensure the efficiency of the submitted models, we additionally measured their runtime and memory requirements on Android smartphones. The proposed solutions significantly improved baseline results defining the state-of-the-art for image enhancement on smartphones.

Viaarxiv icon

Cost-effective Object Detection: Active Sample Mining with Switchable Selection Criteria

Sep 16, 2018
Keze Wang, Liang Lin, Xiaopeng Yan, Ziliang Chen, Dongyu Zhang, Lei Zhang

Figure 1 for Cost-effective Object Detection: Active Sample Mining with Switchable Selection Criteria
Figure 2 for Cost-effective Object Detection: Active Sample Mining with Switchable Selection Criteria
Figure 3 for Cost-effective Object Detection: Active Sample Mining with Switchable Selection Criteria
Figure 4 for Cost-effective Object Detection: Active Sample Mining with Switchable Selection Criteria

Though quite challenging, the training of object detectors using large-scale unlabeled or partially labeled datasets has attracted increasing interests from researchers due to its fundamental importance for applications of neural networks and learning systems. To address this problem, many active learning (AL) methods have been proposed that employ up-to-date detectors to retrieve representative minority samples according to predefined confidence or uncertainty thresholds. However, these AL methods cause the detectors to ignore the remaining majority samples (i.e., those with low uncertainty or high prediction confidence). In this work, by developing a principled active sample mining (ASM) framework, we demonstrate that cost-effectively mining samples from these unlabeled majority data is key to training more powerful object detectors while minimizing user effort. Specifically, our ASM framework involves a selectively switchable sample selection mechanism for determining whether an unlabeled sample should be manually annotated via AL or automatically pseudo-labeled via a novel self-learning process. The proposed process can be compatible with mini-batch based training (i.e., using a batch of unlabeled or partially labeled data as a one-time input) for object detection. Extensive experiments on two public benchmarks clearly demonstrate that our ASM framework can achieve performance comparable to that of alternative methods but with significantly fewer annotations.

* Automatically determining whether an unlabeled sample should be manually annotated or pseudo-labeled via a novel self-learning process (Accepted by TNNLS 2018) The source code is available at http://kezewang.com/codes/ASM_ver1.zip 
Viaarxiv icon

Toward Characteristic-Preserving Image-based Virtual Try-On Network

Sep 12, 2018
Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, Meng Yang

Figure 1 for Toward Characteristic-Preserving Image-based Virtual Try-On Network
Figure 2 for Toward Characteristic-Preserving Image-based Virtual Try-On Network
Figure 3 for Toward Characteristic-Preserving Image-based Virtual Try-On Network
Figure 4 for Toward Characteristic-Preserving Image-based Virtual Try-On Network

Image-based virtual try-on systems for fitting new in-shop clothes into a person image have attracted increasing research attention, yet is still challenging. A desirable pipeline should not only transform the target clothes into the most fitting shape seamlessly but also preserve well the clothes identity in the generated image, that is, the key characteristics (e.g. texture, logo, embroidery) that depict the original clothes. However, previous image-conditioned generation works fail to meet these critical requirements towards the plausible virtual try-on performance since they fail to handle large spatial misalignment between the input image and target clothes. Prior work explicitly tackled spatial deformation using shape context matching, but failed to preserve clothing details due to its coarse-to-fine strategy. In this work, we propose a new fully-learnable Characteristic-Preserving Virtual Try-On Network(CP-VTON) for addressing all real-world challenges in this task. First, CP-VTON learns a thin-plate spline transformation for transforming the in-shop clothes into fitting the body shape of the target person via a new Geometric Matching Module (GMM) rather than computing correspondences of interest points as prior works did. Second, to alleviate boundary artifacts of warped clothes and make the results more realistic, we employ a Try-On Module that learns a composition mask to integrate the warped clothes and the rendered image to ensure smoothness. Extensive experiments on a fashion dataset demonstrate our CP-VTON achieves the state-of-the-art virtual try-on performance both qualitatively and quantitatively.

* Accepted by ECCV 2018 
Viaarxiv icon

Interpretable Visual Question Answering by Reasoning on Dependency Trees

Sep 06, 2018
Qingxing Cao, Xiaodan Liang, Bailin Li, Liang Lin

Figure 1 for Interpretable Visual Question Answering by Reasoning on Dependency Trees
Figure 2 for Interpretable Visual Question Answering by Reasoning on Dependency Trees
Figure 3 for Interpretable Visual Question Answering by Reasoning on Dependency Trees
Figure 4 for Interpretable Visual Question Answering by Reasoning on Dependency Trees

Collaborative reasoning for understanding each image-question pair is very critical but underexplored for an interpretable visual question answering system. Although very recent works also attempted to use explicit compositional processes to assemble multiple subtasks embedded in the questions, their models heavily rely on annotations or handcrafted rules to obtain valid reasoning processes, leading to either heavy workloads or poor performance on composition reasoning. In this paper, to better align image and language domains in diverse and unrestricted cases, we propose a novel neural network model that performs global reasoning on a dependency tree parsed from the question, and we thus phrase our model as parse-tree-guided reasoning network (PTGRN). This network consists of three collaborative modules: i) an attention module to exploit the local visual evidence for each word parsed from the question, ii) a gated residual composition module to compose the previously mined evidence, and iii) a parse-tree-guided propagation module to pass the mined evidence along the parse tree. Our PTGRN is thus capable of building an interpretable VQA system that gradually derives the image cues following a question-driven parse-tree reasoning route. Experiments on relational datasets demonstrate the superiority of our PTGRN over current state-of-the-art VQA methods, and the visualization results highlight the explainable capability of our reasoning system.

* 14 pages, 10 figures. arXiv admin note: text overlap with arXiv:1804.00105 
Viaarxiv icon