Alert button
Picture for Liang Lin

Liang Lin

Alert button

A Deep Structured Model with Radius-Margin Bound for 3D Human Activity Recognition

Dec 05, 2015
Liang Lin, Keze Wang, Wangmeng Zuo, Meng Wang, Jiebo Luo, Lei Zhang

Figure 1 for A Deep Structured Model with Radius-Margin Bound for 3D Human Activity Recognition
Figure 2 for A Deep Structured Model with Radius-Margin Bound for 3D Human Activity Recognition
Figure 3 for A Deep Structured Model with Radius-Margin Bound for 3D Human Activity Recognition
Figure 4 for A Deep Structured Model with Radius-Margin Bound for 3D Human Activity Recognition

Understanding human activity is very challenging even with the recently developed 3D/depth sensors. To solve this problem, this work investigates a novel deep structured model, which adaptively decomposes an activity instance into temporal parts using the convolutional neural networks (CNNs). Our model advances the traditional deep learning approaches in two aspects. First, { we incorporate latent temporal structure into the deep model, accounting for large temporal variations of diverse human activities. In particular, we utilize the latent variables to decompose the input activity into a number of temporally segmented sub-activities, and accordingly feed them into the parts (i.e. sub-networks) of the deep architecture}. Second, we incorporate a radius-margin bound as a regularization term into our deep model, which effectively improves the generalization performance for classification. For model training, we propose a principled learning algorithm that iteratively (i) discovers the optimal latent variables (i.e. the ways of activity decomposition) for all training instances, (ii) { updates the classifiers} based on the generated features, and (iii) updates the parameters of multi-layer neural networks. In the experiments, our approach is validated on several complex scenarios for human activity recognition and demonstrates superior performances over other state-of-the-art approaches.

* International Journal of Computer Vision, Volume 118, Issue 2, pp 256-273 (June 2016)  
* 16 pages, 9 figures, to appear in International Journal of Computer Vision 2015 
Viaarxiv icon

Reversible Recursive Instance-level Object Segmentation

Nov 18, 2015
Xiaodan Liang, Yunchao Wei, Xiaohui Shen, Zequn Jie, Jiashi Feng, Liang Lin, Shuicheng Yan

Figure 1 for Reversible Recursive Instance-level Object Segmentation
Figure 2 for Reversible Recursive Instance-level Object Segmentation
Figure 3 for Reversible Recursive Instance-level Object Segmentation
Figure 4 for Reversible Recursive Instance-level Object Segmentation

In this work, we propose a novel Reversible Recursive Instance-level Object Segmentation (R2-IOS) framework to address the challenging instance-level object segmentation task. R2-IOS consists of a reversible proposal refinement sub-network that predicts bounding box offsets for refining the object proposal locations, and an instance-level segmentation sub-network that generates the foreground mask of the dominant object instance in each proposal. By being recursive, R2-IOS iteratively optimizes the two sub-networks during joint training, in which the refined object proposals and improved segmentation predictions are alternately fed into each other to progressively increase the network capabilities. By being reversible, the proposal refinement sub-network adaptively determines an optimal number of refinement iterations required for each proposal during both training and testing. Furthermore, to handle multiple overlapped instances within a proposal, an instance-aware denoising autoencoder is introduced into the segmentation sub-network to distinguish the dominant object from other distracting instances. Extensive experiments on the challenging PASCAL VOC 2012 benchmark well demonstrate the superiority of R2-IOS over other state-of-the-art methods. In particular, the $\text{AP}^r$ over $20$ classes at $0.5$ IoU achieves $66.7\%$, which significantly outperforms the results of $58.7\%$ by PFN~\cite{PFN} and $46.3\%$ by~\cite{liu2015multi}.

* 9 pages 
Viaarxiv icon

Semantic Object Parsing with Local-Global Long Short-Term Memory

Nov 14, 2015
Xiaodan Liang, Xiaohui Shen, Donglai Xiang, Jiashi Feng, Liang Lin, Shuicheng Yan

Figure 1 for Semantic Object Parsing with Local-Global Long Short-Term Memory
Figure 2 for Semantic Object Parsing with Local-Global Long Short-Term Memory
Figure 3 for Semantic Object Parsing with Local-Global Long Short-Term Memory
Figure 4 for Semantic Object Parsing with Local-Global Long Short-Term Memory

Semantic object parsing is a fundamental task for understanding objects in detail in computer vision community, where incorporating multi-level contextual information is critical for achieving such fine-grained pixel-level recognition. Prior methods often leverage the contextual information through post-processing predicted confidence maps. In this work, we propose a novel deep Local-Global Long Short-Term Memory (LG-LSTM) architecture to seamlessly incorporate short-distance and long-distance spatial dependencies into the feature learning over all pixel positions. In each LG-LSTM layer, local guidance from neighboring positions and global guidance from the whole image are imposed on each position to better exploit complex local and global contextual information. Individual LSTMs for distinct spatial dimensions are also utilized to intrinsically capture various spatial layouts of semantic parts in the images, yielding distinct hidden and memory cells of each position for each dimension. In our parsing approach, several LG-LSTM layers are stacked and appended to the intermediate convolutional layers to directly enhance visual features, allowing network parameters to be learned in an end-to-end way. The long chains of sequential computation by stacked LG-LSTM layers also enable each pixel to sense a much larger region for inference benefiting from the memorization of previous dependencies in all positions along all dimensions. Comprehensive evaluations on three public datasets well demonstrate the significant superiority of our LG-LSTM over other state-of-the-art methods.

* 10 pages 
Viaarxiv icon

Proposal-free Network for Instance-level Object Segmentation

Sep 10, 2015
Xiaodan Liang, Yunchao Wei, Xiaohui Shen, Jianchao Yang, Liang Lin, Shuicheng Yan

Figure 1 for Proposal-free Network for Instance-level Object Segmentation
Figure 2 for Proposal-free Network for Instance-level Object Segmentation
Figure 3 for Proposal-free Network for Instance-level Object Segmentation
Figure 4 for Proposal-free Network for Instance-level Object Segmentation

Instance-level object segmentation is an important yet under-explored task. The few existing studies are almost all based on region proposal methods to extract candidate segments and then utilize object classification to produce final results. Nonetheless, generating accurate region proposals itself is quite challenging. In this work, we propose a Proposal-Free Network (PFN ) to address the instance-level object segmentation problem, which outputs the instance numbers of different categories and the pixel-level information on 1) the coordinates of the instance bounding box each pixel belongs to, and 2) the confidences of different categories for each pixel, based on pixel-to-pixel deep convolutional neural network. All the outputs together, by using any off-the-shelf clustering method for simple post-processing, can naturally generate the ultimate instance-level object segmentation results. The whole PFN can be easily trained in an end-to-end way without the requirement of a proposal generation stage. Extensive evaluations on the challenging PASCAL VOC 2012 semantic segmentation benchmark demonstrate that the proposed PFN solution well beats the state-of-the-arts for instance-level object segmentation. In particular, the $AP^r$ over 20 classes at 0.5 IoU reaches 58.7% by PFN, significantly higher than 43.8% and 46.3% by the state-of-the-art algorithms, SDS [9] and [16], respectively.

* 14 pages 
Viaarxiv icon

Bit-Scalable Deep Hashing with Regularized Similarity Learning for Image Retrieval and Person Re-identification

Aug 21, 2015
Ruimao Zhang, Liang Lin, Rui Zhang, Wangmeng Zuo, Lei Zhang

Figure 1 for Bit-Scalable Deep Hashing with Regularized Similarity Learning for Image Retrieval and Person Re-identification
Figure 2 for Bit-Scalable Deep Hashing with Regularized Similarity Learning for Image Retrieval and Person Re-identification
Figure 3 for Bit-Scalable Deep Hashing with Regularized Similarity Learning for Image Retrieval and Person Re-identification
Figure 4 for Bit-Scalable Deep Hashing with Regularized Similarity Learning for Image Retrieval and Person Re-identification

Extracting informative image features and learning effective approximate hashing functions are two crucial steps in image retrieval . Conventional methods often study these two steps separately, e.g., learning hash functions from a predefined hand-crafted feature space. Meanwhile, the bit lengths of output hashing codes are preset in most previous methods, neglecting the significance level of different bits and restricting their practical flexibility. To address these issues, we propose a supervised learning framework to generate compact and bit-scalable hashing codes directly from raw images. We pose hashing learning as a problem of regularized similarity learning. Specifically, we organize the training images into a batch of triplet samples, each sample containing two images with the same label and one with a different label. With these triplet samples, we maximize the margin between matched pairs and mismatched pairs in the Hamming space. In addition, a regularization term is introduced to enforce the adjacency consistency, i.e., images of similar appearances should have similar codes. The deep convolutional neural network is utilized to train the model in an end-to-end fashion, where discriminative image features and hash functions are simultaneously optimized. Furthermore, each bit of our hashing codes is unequally weighted so that we can manipulate the code lengths by truncating the insignificant bits. Our framework outperforms state-of-the-arts on public benchmarks of similar image search and also achieves promising results in the application of person re-identification in surveillance. It is also shown that the generated bit-scalable hashing codes well preserve the discriminative powers with shorter code lengths.

* 14 pages, 5 figures. IEEE Transactions on Image Processing 2015 
Viaarxiv icon

Deep Boosting: Joint Feature Selection and Analysis Dictionary Learning in Hierarchy

Aug 11, 2015
Zhanglin Peng, Ya Li, Zhaoquan Cai, Liang Lin

Figure 1 for Deep Boosting: Joint Feature Selection and Analysis Dictionary Learning in Hierarchy
Figure 2 for Deep Boosting: Joint Feature Selection and Analysis Dictionary Learning in Hierarchy
Figure 3 for Deep Boosting: Joint Feature Selection and Analysis Dictionary Learning in Hierarchy
Figure 4 for Deep Boosting: Joint Feature Selection and Analysis Dictionary Learning in Hierarchy

This work investigates how the traditional image classification pipelines can be extended into a deep architecture, inspired by recent successes of deep neural networks. We propose a deep boosting framework based on layer-by-layer joint feature boosting and dictionary learning. In each layer, we construct a dictionary of filters by combining the filters from the lower layer, and iteratively optimize the image representation with a joint discriminative-generative formulation, i.e. minimization of empirical classification error plus regularization of analysis image generation over training images. For optimization, we perform two iterating steps: i) to minimize the classification error, select the most discriminative features using the gentle adaboost algorithm; ii) according to the feature selection, update the filters to minimize the regularization on analysis image representation using the gradient descent method. Once the optimization is converged, we learn the higher layer representation in the same way. Our model delivers several distinct advantages. First, our layer-wise optimization provides the potential to build very deep architectures. Second, the generated image representation is compact and meaningful. In several visual recognition tasks, our framework outperforms existing state-of-the-art approaches.

Viaarxiv icon

PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Edge-Preserving Coherence

May 13, 2015
Keze Wang, Liang Lin, Jiangbo Lu, Chenglong Li, Keyang Shi

Figure 1 for PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Edge-Preserving Coherence
Figure 2 for PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Edge-Preserving Coherence
Figure 3 for PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Edge-Preserving Coherence
Figure 4 for PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Edge-Preserving Coherence

Driven by recent vision and graphics applications such as image segmentation and object recognition, computing pixel-accurate saliency values to uniformly highlight foreground objects becomes increasingly important. In this paper, we propose a unified framework called PISA, which stands for Pixelwise Image Saliency Aggregating various bottom-up cues and priors. It generates spatially coherent yet detail-preserving, pixel-accurate and fine-grained saliency, and overcomes the limitations of previous methods which use homogeneous superpixel-based and color only treatment. PISA aggregates multiple saliency cues in a global context such as complementary color and structure contrast measures with their spatial priors in the image domain. The saliency confidence is further jointly modeled with a neighborhood consistence constraint into an energy minimization formulation, in which each pixel will be evaluated with multiple hypothetical saliency levels. Instead of using global discrete optimization methods, we employ the cost-volume filtering technique to solve our formulation, assigning the saliency levels smoothly while preserving the edge-aware structure details. In addition, a faster version of PISA is developed using a gradient-driven image sub-sampling strategy to greatly improve the runtime efficiency while keeping comparable detection accuracy. Extensive experiments on a number of public datasets suggest that PISA convincingly outperforms other state-of-the-art approaches. In addition, with this work we also create a new dataset containing $800$ commodity images for evaluating saliency detection. The dataset and source code of PISA can be downloaded at http://vision.sysu.edu.cn/project/PISA/

* IEEE Transactions on Image Processing (TIP), volume. 24, Issue. 10, page. 3019 - 3033, Oct. 2015  
* 14 pages, 14 figures, 1 table, to appear in IEEE Transactions on Image Processing 
Viaarxiv icon

Computational Baby Learning

May 04, 2015
Xiaodan Liang, Si Liu, Yunchao Wei, Luoqi Liu, Liang Lin, Shuicheng Yan

Figure 1 for Computational Baby Learning
Figure 2 for Computational Baby Learning
Figure 3 for Computational Baby Learning
Figure 4 for Computational Baby Learning

Intuitive observations show that a baby may inherently possess the capability of recognizing a new visual concept (e.g., chair, dog) by learning from only very few positive instances taught by parent(s) or others, and this recognition capability can be gradually further improved by exploring and/or interacting with the real instances in the physical world. Inspired by these observations, we propose a computational model for slightly-supervised object detection, based on prior knowledge modelling, exemplar learning and learning with video contexts. The prior knowledge is modeled with a pre-trained Convolutional Neural Network (CNN). When very few instances of a new concept are given, an initial concept detector is built by exemplar learning over the deep features from the pre-trained CNN. Simulating the baby's interaction with physical world, the well-designed tracking solution is then used to discover more diverse instances from the massive online unlabeled videos. Once a positive instance is detected/identified with high score in each video, more variable instances possibly from different view-angles and/or different distances are tracked and accumulated. Then the concept detector can be fine-tuned based on these new instances. This process can be repeated again and again till we obtain a very mature concept detector. Extensive experiments on Pascal VOC-07/10/12 object detection datasets well demonstrate the effectiveness of our framework. It can beat the state-of-the-art full-training based performances by learning from very few samples for each object category, along with about 20,000 unlabeled videos.

* 9 pages 
Viaarxiv icon

F-SVM: Combination of Feature Transformation and SVM Learning via Convex Relaxation

Apr 20, 2015
Xiaohe Wu, Wangmeng Zuo, Yuanyuan Zhu, Liang Lin

Figure 1 for F-SVM: Combination of Feature Transformation and SVM Learning via Convex Relaxation
Figure 2 for F-SVM: Combination of Feature Transformation and SVM Learning via Convex Relaxation
Figure 3 for F-SVM: Combination of Feature Transformation and SVM Learning via Convex Relaxation
Figure 4 for F-SVM: Combination of Feature Transformation and SVM Learning via Convex Relaxation

The generalization error bound of support vector machine (SVM) depends on the ratio of radius and margin, while standard SVM only considers the maximization of the margin but ignores the minimization of the radius. Several approaches have been proposed to integrate radius and margin for joint learning of feature transformation and SVM classifier. However, most of them either require the form of the transformation matrix to be diagonal, or are non-convex and computationally expensive. In this paper, we suggest a novel approximation for the radius of minimum enclosing ball (MEB) in feature space, and then propose a convex radius-margin based SVM model for joint learning of feature transformation and SVM classifier, i.e., F-SVM. An alternating minimization method is adopted to solve the F-SVM model, where the feature transformation is updatedvia gradient descent and the classifier is updated by employing the existing SVM solver. By incorporating with kernel principal component analysis, F-SVM is further extended for joint learning of nonlinear transformation and classifier. Experimental results on the UCI machine learning datasets and the LFW face datasets show that F-SVM outperforms the standard SVM and the existing radius-margin based SVMs, e.g., RMM, R-SVM+ and R-SVM+{\mu}.

* 11 pages, 5 figures 
Viaarxiv icon

End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning

Apr 11, 2015
Liliang Zhang, Liang Lin, Xian Wu, Shengyong Ding, Lei Zhang

Figure 1 for End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning
Figure 2 for End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning
Figure 3 for End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning
Figure 4 for End-to-End Photo-Sketch Generation via Fully Convolutional Representation Learning

Sketch-based face recognition is an interesting task in vision and multimedia research, yet it is quite challenging due to the great difference between face photos and sketches. In this paper, we propose a novel approach for photo-sketch generation, aiming to automatically transform face photos into detail-preserving personal sketches. Unlike the traditional models synthesizing sketches based on a dictionary of exemplars, we develop a fully convolutional network to learn the end-to-end photo-sketch mapping. Our approach takes whole face photos as inputs and directly generates the corresponding sketch images with efficient inference and learning, in which the architecture are stacked by only convolutional kernels of very small sizes. To well capture the person identity during the photo-sketch transformation, we define our optimization objective in the form of joint generative-discriminative minimization. In particular, a discriminative regularization term is incorporated into the photo-sketch generation, enhancing the discriminability of the generated person sketches against other individuals. Extensive experiments on several standard benchmarks suggest that our approach outperforms other state-of-the-art methods in both photo-sketch generation and face sketch verification.

* 8 pages, 6 figures. Proceeding in ACM International Conference on Multimedia Retrieval (ICMR), 2015 
Viaarxiv icon