Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Liang Lin

Learning Warped Guidance for Blind Face Restoration

Apr 16, 2018

Xiaoming Li, Ming Liu, Yuting Ye, Wangmeng Zuo, Liang Lin, Ruigang Yang

Figure 1 for Learning Warped Guidance for Blind Face Restoration

Figure 2 for Learning Warped Guidance for Blind Face Restoration

Figure 3 for Learning Warped Guidance for Blind Face Restoration

Figure 4 for Learning Warped Guidance for Blind Face Restoration

Abstract:This paper studies the problem of blind face restoration from an unconstrained blurry, noisy, low-resolution, or compressed image (i.e., degraded observation). For better recovery of fine facial details, we modify the problem setting by taking both the degraded observation and a high-quality guided image of the same identity as input to our guided face restoration network (GFRNet). However, the degraded observation and guided image generally are different in pose, illumination and expression, thereby making plain CNNs (e.g., U-Net) fail to recover fine and identity-aware facial details. To tackle this issue, our GFRNet model includes both a warping subnetwork (WarpNet) and a reconstruction subnetwork (RecNet). The WarpNet is introduced to predict flow field for warping the guided image to correct pose and expression (i.e., warped guidance), while the RecNet takes the degraded observation and warped guidance as input to produce the restoration result. Due to that the ground-truth flow field is unavailable, landmark loss together with total variation regularization are incorporated to guide the learning of WarpNet. Furthermore, to make the model applicable to blind restoration, our GFRNet is trained on the synthetic data with versatile settings on blur kernel, noise level, downsampling scale factor, and JPEG quality factor. Experiments show that our GFRNet not only performs favorably against the state-of-the-art image and face restoration methods, but also generates visually photo-realistic results on real degraded facial images.

* 25 pages, 14 figures and 1 table

Via

Access Paper or Ask Questions

Crafting a Toolchain for Image Restoration by Deep Reinforcement Learning

Apr 10, 2018

Ke Yu, Chao Dong, Liang Lin, Chen Change Loy

Figure 1 for Crafting a Toolchain for Image Restoration by Deep Reinforcement Learning

Figure 2 for Crafting a Toolchain for Image Restoration by Deep Reinforcement Learning

Figure 3 for Crafting a Toolchain for Image Restoration by Deep Reinforcement Learning

Figure 4 for Crafting a Toolchain for Image Restoration by Deep Reinforcement Learning

Abstract:We investigate a novel approach for image restoration by reinforcement learning. Unlike existing studies that mostly train a single large network for a specialized task, we prepare a toolbox consisting of small-scale convolutional networks of different complexities and specialized in different tasks. Our method, RL-Restore, then learns a policy to select appropriate tools from the toolbox to progressively restore the quality of a corrupted image. We formulate a step-wise reward function proportional to how well the image is restored at each step to learn the action policy. We also devise a joint learning scheme to train the agent and tools for better performance in handling uncertainty. In comparison to conventional human-designed networks, RL-Restore is capable of restoring images corrupted with complex and unknown distortions in a more parameter-efficient manner using the dynamically formed toolchain.

* To appear at CVPR 2018 (Spotlight). Project page: http://mmlab.ie.cuhk.edu.hk/projects/RL-Restore/

Via

Access Paper or Ask Questions

Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark

Apr 05, 2018

Xiaodan Liang, Ke Gong, Xiaohui Shen, Liang Lin

Figure 1 for Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark

Figure 2 for Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark

Figure 3 for Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark

Figure 4 for Look into Person: Joint Body Parsing & Pose Estimation Network and A New Benchmark

Abstract:Human parsing and pose estimation have recently received considerable interest due to their substantial application potentials. However, the existing datasets have limited numbers of images and annotations and lack a variety of human appearances and coverage of challenging cases in unconstrained environments. In this paper, we introduce a new benchmark named "Look into Person (LIP)" that provides a significant advancement in terms of scalability, diversity, and difficulty, which are crucial for future developments in human-centric analysis. This comprehensive dataset contains over 50,000 elaborately annotated images with 19 semantic part labels and 16 body joints, which are captured from a broad range of viewpoints, occlusions, and background complexities. Using these rich annotations, we perform detailed analyses of the leading human parsing and pose estimation approaches, thereby obtaining insights into the successes and failures of these methods. To further explore and take advantage of the semantic correlation of these two tasks, we propose a novel joint human parsing and pose estimation network to explore efficient context modeling, which can simultaneously predict parsing and pose with extremely high quality. Furthermore, we simplify the network to solve human parsing by exploring a novel self-supervised structure-sensitive learning approach, which imposes human pose structures into the parsing results without resorting to extra supervision. The dataset, code and models are available at http://www.sysu-hcp.net/lip/.

* We proposed the most comprehensive dataset around the world for human-centric analysis! (Accepted By T-PAMI 2018) The dataset, code and models are available at http://www.sysu-hcp.net/lip/ . arXiv admin note: substantial text overlap with arXiv:1703.05446

Via

Access Paper or Ask Questions

Visual Question Reasoning on General Dependency Tree

Mar 31, 2018

Qingxing Cao, Xiaodan Liang, Bailing Li, Guanbin Li, Liang Lin

Figure 1 for Visual Question Reasoning on General Dependency Tree

Figure 2 for Visual Question Reasoning on General Dependency Tree

Figure 3 for Visual Question Reasoning on General Dependency Tree

Figure 4 for Visual Question Reasoning on General Dependency Tree

Abstract:The collaborative reasoning for understanding each image-question pair is very critical but under-explored for an interpretable Visual Question Answering (VQA) system. Although very recent works also tried the explicit compositional processes to assemble multiple sub-tasks embedded in the questions, their models heavily rely on the annotations or hand-crafted rules to obtain valid reasoning layout, leading to either heavy labor or poor performance on composition reasoning. In this paper, to enable global context reasoning for better aligning image and language domains in diverse and unrestricted cases, we propose a novel reasoning network called Adversarial Composition Modular Network (ACMN). This network comprises of two collaborative modules: i) an adversarial attention module to exploit the local visual evidence for each word parsed from the question; ii) a residual composition module to compose the previously mined evidence. Given a dependency parse tree for each question, the adversarial attention module progressively discovers salient regions of one word by densely combining regions of child word nodes in an adversarial manner. Then residual composition module merges the hidden representations of an arbitrary number of children through sum pooling and residual connection. Our ACMN is thus capable of building an interpretable VQA system that gradually dives the image cues following a question-driven reasoning route and makes global reasoning by incorporating the learned knowledge of all attention modules in a principled manner. Experiments on relational datasets demonstrate the superiority of our ACMN and visualization results show the explainable capability of our reasoning system.

* Accepted as spotlight at CVPR 2018

Via

Access Paper or Ask Questions

Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

Mar 18, 2018

Jiahao Pang, Wenxiu Sun, Chengxi Yang, Jimmy Ren, Ruichao Xiao, Jin Zeng, Liang Lin

Figure 1 for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

Figure 2 for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

Figure 3 for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

Figure 4 for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

Abstract:Despite the recent success of stereo matching with convolutional neural networks (CNNs), it remains arduous to generalize a pre-trained deep stereo model to a novel domain. A major difficulty is to collect accurate ground-truth disparities for stereo pairs in the target domain. In this work, we propose a self-adaptation approach for CNN training, utilizing both synthetic training data (with ground-truth disparities) and stereo pairs in the new domain (without ground-truths). Our method is driven by two empirical observations. By feeding real stereo pairs of different domains to stereo models pre-trained with synthetic data, we see that: i) a pre-trained model does not generalize well to the new domain, producing artifacts at boundaries and ill-posed regions; however, ii) feeding an up-sampled stereo pair leads to a disparity map with extra details. To avoid i) while exploiting ii), we formulate an iterative optimization problem with graph Laplacian regularization. At each iteration, the CNN adapts itself better to the new domain: we let the CNN learn its own higher-resolution output; at the meanwhile, a graph Laplacian regularization is imposed to discriminatively keep the desired edges while smoothing out the artifacts. We demonstrate the effectiveness of our method in two domains: daily scenes collected by smartphone cameras, and street views captured in a driving car.

* Accepted at CVPR 2018

Via

Access Paper or Ask Questions

Weakly Supervised Salient Object Detection Using Image Labels

Mar 17, 2018

Guanbin Li, Yuan Xie, Liang Lin

Figure 1 for Weakly Supervised Salient Object Detection Using Image Labels

Figure 2 for Weakly Supervised Salient Object Detection Using Image Labels

Figure 3 for Weakly Supervised Salient Object Detection Using Image Labels

Figure 4 for Weakly Supervised Salient Object Detection Using Image Labels

Abstract:Deep learning based salient object detection has recently achieved great success with its performance greatly outperforms any other unsupervised methods. However, annotating per-pixel saliency masks is a tedious and inefficient procedure. In this paper, we note that superior salient object detection can be obtained by iteratively mining and correcting the labeling ambiguity on saliency maps from traditional unsupervised methods. We propose to use the combination of a coarse salient object activation map from the classification network and saliency maps generated from unsupervised methods as pixel-level annotation, and develop a simple yet very effective algorithm to train fully convolutional networks for salient object detection supervised by these noisy annotations. Our algorithm is based on alternately exploiting a graphical model and training a fully convolutional network for model updating. The graphical model corrects the internal labeling ambiguity through spatial consistency and structure preserving while the fully convolutional network helps to correct the cross-image semantic ambiguity and simultaneously update the coarse activation map for next iteration. Experimental results demonstrate that our proposed method greatly outperforms all state-of-the-art unsupervised saliency detection methods and can be comparable to the current best strongly-supervised methods training with thousands of pixel-level saliency map annotations on all public benchmarks.

* Accept by AAAI2018

Via

Access Paper or Ask Questions

Single View Stereo Matching

Mar 09, 2018

Yue Luo, Jimmy Ren, Mude Lin, Jiahao Pang, Wenxiu Sun, Hongsheng Li, Liang Lin

Figure 1 for Single View Stereo Matching

Figure 2 for Single View Stereo Matching

Figure 3 for Single View Stereo Matching

Figure 4 for Single View Stereo Matching

Abstract:Previous monocular depth estimation methods take a single view and directly regress the expected results. Though recent advances are made by applying geometrically inspired loss functions during training, the inference procedure does not explicitly impose any geometrical constraint. Therefore these models purely rely on the quality of data and the effectiveness of learning to generalize. This either leads to suboptimal results or the demand of huge amount of expensive ground truth labelled data to generate reasonable results. In this paper, we show for the first time that the monocular depth estimation problem can be reformulated as two sub-problems, a view synthesis procedure followed by stereo matching, with two intriguing properties, namely i) geometrical constraints can be explicitly imposed during inference; ii) demand on labelled depth data can be greatly alleviated. We show that the whole pipeline can still be trained in an end-to-end fashion and this new formulation plays a critical role in advancing the performance. The resulting model outperforms all the previous monocular depth estimation methods as well as the stereo block matching method in the challenging KITTI dataset by only using a small number of real training data. The model also generalizes well to other monocular depth estimation benchmarks. We also discuss the implications and the advantages of solving monocular depth estimation using stereo methods.

* Spotlight in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

Via

Access Paper or Ask Questions

LSTM Pose Machines

Mar 09, 2018

Yue Luo, Jimmy Ren, Zhouxia Wang, Wenxiu Sun, Jinshan Pan, Jianbo Liu, Jiahao Pang, Liang Lin

Abstract:We observed that recent state-of-the-art results on single image human pose estimation were achieved by multi-stage Convolution Neural Networks (CNN). Notwithstanding the superior performance on static images, the application of these models on videos is not only computationally intensive, it also suffers from performance degeneration and flicking. Such suboptimal results are mainly attributed to the inability of imposing sequential geometric consistency, handling severe image quality degradation (e.g. motion blur and occlusion) as well as the inability of capturing the temporal correlation among video frames. In this paper, we proposed a novel recurrent network to tackle these problems. We showed that if we were to impose the weight sharing scheme to the multi-stage CNN, it could be re-written as a Recurrent Neural Network (RNN). This property decouples the relationship among multiple network stages and results in significantly faster speed in invoking the network for videos. It also enables the adoption of Long Short-Term Memory (LSTM) units between video frames. We found such memory augmented RNN is very effective in imposing geometric consistency among frames. It also well handles input quality degradation in videos while successfully stabilizes the sequential outputs. The experiments showed that our approach significantly outperformed current state-of-the-art methods on two large-scale video pose estimation benchmarks. We also explored the memory cells inside the LSTM and provided insights on why such mechanism would benefit the prediction for video-based pose estimations.

* Poster in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

Via

Access Paper or Ask Questions

Deep Cocktail Network: Multi-source Unsupervised Domain Adaptation with Category Shift

Mar 02, 2018

Ruijia Xu, Ziliang Chen, Wangmeng Zuo, Junjie Yan, Liang Lin

Figure 1 for Deep Cocktail Network: Multi-source Unsupervised Domain Adaptation with Category Shift

Figure 2 for Deep Cocktail Network: Multi-source Unsupervised Domain Adaptation with Category Shift

Figure 3 for Deep Cocktail Network: Multi-source Unsupervised Domain Adaptation with Category Shift

Figure 4 for Deep Cocktail Network: Multi-source Unsupervised Domain Adaptation with Category Shift

Abstract:Unsupervised domain adaptation (UDA) conventionally assumes labeled source samples coming from a single underlying source distribution. Whereas in practical scenario, labeled data are typically collected from diverse sources. The multiple sources are different not only from the target but also from each other, thus, domain adaptater should not be modeled in the same way. Moreover, those sources may not completely share their categories, which further brings a new transfer challenge called category shift. In this paper, we propose a deep cocktail network (DCTN) to battle the domain and category shifts among multiple sources. Motivated by the theoretical results in \cite{mansour2009domain}, the target distribution can be represented as the weighted combination of source distributions, and, the multi-source unsupervised domain adaptation via DCTN is then performed as two alternating steps: i) It deploys multi-way adversarial learning to minimize the discrepancy between the target and each of the multiple source domains, which also obtains the source-specific perplexity scores to denote the possibilities that a target sample belongs to different source domains. ii) The multi-source category classifiers are integrated with the perplexity scores to classify target sample, and the pseudo-labeled target samples together with source samples are utilized to update the multi-source category classifier and the feature extractor. We evaluate DCTN in three domain adaptation benchmarks, which clearly demonstrate the superiority of our framework.

* Accepted for publication in Conference on Computer Vision and Pattern Recognition(CVPR), 2018

Via

Access Paper or Ask Questions

Deep Structured Scene Parsing by Learning with Image Descriptions

Feb 28, 2018

Liang Lin, Guangrun Wang, Rui Zhang, Ruimao Zhang, Xiaodan Liang, Wangmeng Zuo

Figure 1 for Deep Structured Scene Parsing by Learning with Image Descriptions

Figure 2 for Deep Structured Scene Parsing by Learning with Image Descriptions

Figure 3 for Deep Structured Scene Parsing by Learning with Image Descriptions

Figure 4 for Deep Structured Scene Parsing by Learning with Image Descriptions

Abstract:This paper addresses a fundamental problem of scene understanding: How to parse the scene image into a structured configuration (i.e., a semantic object hierarchy with object interaction relations) that finely accords with human perception. We propose a deep architecture consisting of two networks: i) a convolutional neural network (CNN) extracting the image representation for pixelwise object labeling and ii) a recursive neural network (RNN) discovering the hierarchical object structure and the inter-object relations. Rather than relying on elaborative user annotations (e.g., manually labeling semantic maps and relations), we train our deep model in a weakly-supervised manner by leveraging the descriptive sentences of the training images. Specifically, we decompose each sentence into a semantic tree consisting of nouns and verb phrases, and facilitate these trees discovering the configurations of the training images. Once these scene configurations are determined, then the parameters of both the CNN and RNN are updated accordingly by back propagation. The entire model training is accomplished through an Expectation-Maximization method. Extensive experiments suggest that our model is capable of producing meaningful and structured scene configurations and achieving more favorable scene labeling performance on PASCAL VOC 2012 over other state-of-the-art weakly-supervised methods.

* Discovering a semantic object hierarchy with object interaction relations (Publhised in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. (oral))

Via

Access Paper or Ask Questions