Alert button
Picture for Liang Lin

Liang Lin

Alert button

Matching-CNN Meets KNN: Quasi-Parametric Human Parsing

Apr 06, 2015
Si Liu, Xiaodan Liang, Luoqi Liu, Xiaohui Shen, Jianchao Yang, Changsheng Xu, Liang Lin, Xiaochun Cao, Shuicheng Yan

Figure 1 for Matching-CNN Meets KNN: Quasi-Parametric Human Parsing
Figure 2 for Matching-CNN Meets KNN: Quasi-Parametric Human Parsing
Figure 3 for Matching-CNN Meets KNN: Quasi-Parametric Human Parsing
Figure 4 for Matching-CNN Meets KNN: Quasi-Parametric Human Parsing

Both parametric and non-parametric approaches have demonstrated encouraging performances in the human parsing task, namely segmenting a human image into several semantic regions (e.g., hat, bag, left arm, face). In this work, we aim to develop a new solution with the advantages of both methodologies, namely supervision from annotated data and the flexibility to use newly annotated (possibly uncommon) images, and present a quasi-parametric human parsing model. Under the classic K Nearest Neighbor (KNN)-based nonparametric framework, the parametric Matching Convolutional Neural Network (M-CNN) is proposed to predict the matching confidence and displacements of the best matched region in the testing image for a particular semantic region in one KNN image. Given a testing image, we first retrieve its KNN images from the annotated/manually-parsed human image corpus. Then each semantic region in each KNN image is matched with confidence to the testing image using M-CNN, and the matched regions from all KNN images are further fused, followed by a superpixel smoothing procedure to obtain the ultimate human parsing result. The M-CNN differs from the classic CNN in that the tailored cross image matching filters are introduced to characterize the matching between the testing image and the semantic region of a KNN image. The cross image matching filters are defined at different convolutional layers, each aiming to capture a particular range of displacements. Comprehensive evaluations over a large dataset with 7,700 annotated human images well demonstrate the significant performance gain from the quasi-parametric model over the state-of-the-arts, for the human parsing task.

* This manuscript is the accepted version for CVPR 2015 
Viaarxiv icon

Deep Human Parsing with Active Template Regression

Mar 09, 2015
Xiaodan Liang, Si Liu, Xiaohui Shen, Jianchao Yang, Luoqi Liu, Jian Dong, Liang Lin, Shuicheng Yan

Figure 1 for Deep Human Parsing with Active Template Regression
Figure 2 for Deep Human Parsing with Active Template Regression
Figure 3 for Deep Human Parsing with Active Template Regression
Figure 4 for Deep Human Parsing with Active Template Regression

In this work, the human parsing task, namely decomposing a human image into semantic fashion/body regions, is formulated as an Active Template Regression (ATR) problem, where the normalized mask of each fashion/body item is expressed as the linear combination of the learned mask templates, and then morphed to a more precise mask with the active shape parameters, including position, scale and visibility of each semantic region. The mask template coefficients and the active shape parameters together can generate the human parsing results, and are thus called the structure outputs for human parsing. The deep Convolutional Neural Network (CNN) is utilized to build the end-to-end relation between the input human image and the structure outputs for human parsing. More specifically, the structure outputs are predicted by two separate networks. The first CNN network is with max-pooling, and designed to predict the template coefficients for each label mask, while the second CNN network is without max-pooling to preserve sensitivity to label mask position and accurately predict the active shape parameters. For a new image, the structure outputs of the two networks are fused to generate the probability of each label for each pixel, and super-pixel smoothing is finally used to refine the human parsing result. Comprehensive evaluations on a large dataset well demonstrate the significant superiority of the ATR framework over other state-of-the-arts for human parsing. In particular, the F1-score reaches $64.38\%$ by our ATR framework, significantly higher than $44.76\%$ based on the state-of-the-art algorithm.

* This manuscript is the accepted version for IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 2015 
Viaarxiv icon

Recognizing Focal Liver Lesions in Contrast-Enhanced Ultrasound with Discriminatively Trained Spatio-Temporal Model

Feb 03, 2015
Xiaodan Liang, Qingxing Cao, Rui Huang, Liang Lin

Figure 1 for Recognizing Focal Liver Lesions in Contrast-Enhanced Ultrasound with Discriminatively Trained Spatio-Temporal Model
Figure 2 for Recognizing Focal Liver Lesions in Contrast-Enhanced Ultrasound with Discriminatively Trained Spatio-Temporal Model
Figure 3 for Recognizing Focal Liver Lesions in Contrast-Enhanced Ultrasound with Discriminatively Trained Spatio-Temporal Model
Figure 4 for Recognizing Focal Liver Lesions in Contrast-Enhanced Ultrasound with Discriminatively Trained Spatio-Temporal Model

The aim of this study is to provide an automatic computational framework to assist clinicians in diagnosing Focal Liver Lesions (FLLs) in Contrast-Enhancement Ultrasound (CEUS). We represent FLLs in a CEUS video clip as an ensemble of Region-of-Interests (ROIs), whose locations are modeled as latent variables in a discriminative model. Different types of FLLs are characterized by both spatial and temporal enhancement patterns of the ROIs. The model is learned by iteratively inferring the optimal ROI locations and optimizing the model parameters. To efficiently search the optimal spatial and temporal locations of the ROIs, we propose a data-driven inference algorithm by combining effective spatial and temporal pruning. The experiments show that our method achieves promising results on the largest dataset in the literature (to the best of our knowledge), which we have made publicly available.

* Biomedical Imaging (ISBI), 2014 IEEE 11th International Symposium on , vol., no., pp.1184-1187, April 2014  
* 5 pages, 1 figures 
Viaarxiv icon

Data-Driven Scene Understanding with Adaptively Retrieved Exemplars

Feb 03, 2015
Xionghao Liu, Wei Yang, Liang Lin, Qing Wang, Zhaoquan Cai, Jianhuang Lai

Figure 1 for Data-Driven Scene Understanding with Adaptively Retrieved Exemplars
Figure 2 for Data-Driven Scene Understanding with Adaptively Retrieved Exemplars
Figure 3 for Data-Driven Scene Understanding with Adaptively Retrieved Exemplars
Figure 4 for Data-Driven Scene Understanding with Adaptively Retrieved Exemplars

This article investigates a data-driven approach for semantically scene understanding, without pixelwise annotation and classifier training. Our framework parses a target image with two steps: (i) retrieving its exemplars (i.e. references) from an image database, where all images are unsegmented but annotated with tags; (ii) recovering its pixel labels by propagating semantics from the references. We present a novel framework making the two steps mutually conditional and bootstrapped under the probabilistic Expectation-Maximization (EM) formulation. In the first step, the references are selected by jointly matching their appearances with the target as well as the semantics (i.e. the assigned labels of the target and the references). We process the second step via a combinatorial graphical representation, in which the vertices are superpixels extracted from the target and its selected references. Then we derive the potentials of assigning labels to one vertex of the target, which depend upon the graph edges that connect the vertex to its spatial neighbors of the target and to its similar vertices of the references. Besides, the proposed framework can be naturally applied to perform image annotation on new test images. In the experiments, we validate our approach on two public databases, and demonstrate superior performances over the state-of-the-art methods in both semantic segmentation and image annotation tasks.

* MultiMedia, IEEE , vol.PP, no.99, pp.1,1, 2015  
* 8 pages, 5 figures 
Viaarxiv icon

Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection

Feb 03, 2015
Xiaolong Wang, Liang Lin, Lichao Huang, Shuicheng Yan

Figure 1 for Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
Figure 2 for Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
Figure 3 for Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection
Figure 4 for Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection

This paper proposes a reconfigurable model to recognize and detect multiclass (or multiview) objects with large variation in appearance. Compared with well acknowledged hierarchical models, we study two advanced capabilities in hierarchy for object modeling: (i) "switch" variables(i.e. or-nodes) for specifying alternative compositions, and (ii) making local classifiers (i.e. leaf-nodes) shared among different classes. These capabilities enable us to account well for structural variabilities while preserving the model compact. Our model, in the form of an And-Or Graph, comprises four layers: a batch of leaf-nodes with collaborative edges in bottom for localizing object parts; the or-nodes over bottom to activate their children leaf-nodes; the and-nodes to classify objects as a whole; one root-node on the top for switching multiclass classification, which is also an or-node. For model training, we present an EM-type algorithm, namely dynamical structural optimization (DSO), to iteratively determine the structural configuration, (e.g., leaf-node generation associated with their parent or-nodes and shared across other classes), along with optimizing multi-layer parameters. The proposed method is valid on challenging databases, e.g., PASCAL VOC 2007 and UIUC-People, and it achieves state-of-the-arts performance.

* Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on , vol., no., pp.3334,3341, 23-28 June 2013  
* 8 pages, 6 figures, CVPR 2013 
Viaarxiv icon

Deep Joint Task Learning for Generic Object Extraction

Feb 03, 2015
Xiaolong Wang, Liliang Zhang, Liang Lin, Zhujin Liang, Wangmeng Zuo

Figure 1 for Deep Joint Task Learning for Generic Object Extraction
Figure 2 for Deep Joint Task Learning for Generic Object Extraction
Figure 3 for Deep Joint Task Learning for Generic Object Extraction
Figure 4 for Deep Joint Task Learning for Generic Object Extraction

This paper investigates how to extract objects-of-interest without relying on hand-craft features and sliding windows approaches, that aims to jointly solve two sub-tasks: (i) rapidly localizing salient objects from images, and (ii) accurately segmenting the objects based on the localizations. We present a general joint task learning framework, in which each task (either object localization or object segmentation) is tackled via a multi-layer convolutional neural network, and the two networks work collaboratively to boost performance. In particular, we propose to incorporate latent variables bridging the two networks in a joint optimization manner. The first network directly predicts the positions and scales of salient objects from raw images, and the latent variables adjust the object localizations to feed the second network that produces pixelwise object masks. An EM-type method is presented for the optimization, iterating with two steps: (i) by using the two networks, it estimates the latent variables by employing an MCMC-based sampling method; (ii) it optimizes the parameters of the two networks unitedly via back propagation, with the fixed latent variables. Extensive experiments suggest that our framework significantly outperforms other state-of-the-art approaches in both accuracy and efficiency (e.g. 1000 times faster than competing approaches).

* Advances in Neural Information Processing Systems (pp. 523-531), 2014  
* 9 pages, 4 figures, NIPS 2014 
Viaarxiv icon

Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

Feb 03, 2015
Xiaolong Wang, Liang Lin

Figure 1 for Dynamical And-Or Graph Learning for Object Shape Modeling and Detection
Figure 2 for Dynamical And-Or Graph Learning for Object Shape Modeling and Detection
Figure 3 for Dynamical And-Or Graph Learning for Object Shape Modeling and Detection
Figure 4 for Dynamical And-Or Graph Learning for Object Shape Modeling and Detection

This paper studies a novel discriminative part-based model to represent and recognize object shapes with an "And-Or graph". We define this model consisting of three layers: the leaf-nodes with collaborative edges for localizing local parts, the or-nodes specifying the switch of leaf-nodes, and the root-node encoding the global verification. A discriminative learning algorithm, extended from the CCCP [23], is proposed to train the model in a dynamical manner: the model structure (e.g., the configuration of the leaf-nodes associated with the or-nodes) is automatically determined with optimizing the multi-layer parameters during the iteration. The advantages of our method are two-fold. (i) The And-Or graph model enables us to handle well large intra-class variance and background clutters for object shape detection from images. (ii) The proposed learning algorithm is able to obtain the And-Or graph representation without requiring elaborate supervision and initialization. We validate the proposed method on several challenging databases (e.g., INRIA-Horse, ETHZ-Shape, and UIUC-People), and it outperforms the state-of-the-arts approaches.

* Advances in Neural Information Processing Systems (pp. 242-250), 2014  
* 9 pages, 4 figures, NIPS 2012 
Viaarxiv icon

Clothing Co-Parsing by Joint Image Segmentation and Labeling

Feb 03, 2015
Wei Yang, Ping Luo, Liang Lin

Figure 1 for Clothing Co-Parsing by Joint Image Segmentation and Labeling
Figure 2 for Clothing Co-Parsing by Joint Image Segmentation and Labeling
Figure 3 for Clothing Co-Parsing by Joint Image Segmentation and Labeling
Figure 4 for Clothing Co-Parsing by Joint Image Segmentation and Labeling

This paper aims at developing an integrated system of clothing co-parsing, in order to jointly parse a set of clothing images (unsegmented but annotated with tags) into semantic configurations. We propose a data-driven framework consisting of two phases of inference. The first phase, referred as "image co-segmentation", iterates to extract consistent regions on images and jointly refines the regions over all images by employing the exemplar-SVM (E-SVM) technique [23]. In the second phase (i.e. "region co-labeling"), we construct a multi-image graphical model by taking the segmented regions as vertices, and incorporate several contexts of clothing configuration (e.g., item location and mutual interactions). The joint label assignment can be solved using the efficient Graph Cuts algorithm. In addition to evaluate our framework on the Fashionista dataset [30], we construct a dataset called CCP consisting of 2098 high-resolution street fashion photos to demonstrate the performance of our system. We achieve 90.29% / 88.23% segmentation accuracy and 65.52% / 63.89% recognition rate on the Fashionista and the CCP datasets, respectively, which are superior compared with state-of-the-art methods.

* Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on , vol., no., pp.3182,3189, 23-28 June 2014  
* 8 pages, 5 figures, CVPR 2014 
Viaarxiv icon

Learning Contour-Fragment-based Shape Model with And-Or Tree Representation

Feb 03, 2015
Liang Lin, Xiaolong Wang, Wei Yang, Jianhuang Lai

Figure 1 for Learning Contour-Fragment-based Shape Model with And-Or Tree Representation
Figure 2 for Learning Contour-Fragment-based Shape Model with And-Or Tree Representation
Figure 3 for Learning Contour-Fragment-based Shape Model with And-Or Tree Representation
Figure 4 for Learning Contour-Fragment-based Shape Model with And-Or Tree Representation

This paper proposes a simple yet effective method to learn the hierarchical object shape model consisting of local contour fragments, which represents a category of shapes in the form of an And-Or tree. This model extends the traditional hierarchical tree structures by introducing the "switch" variables (i.e. the or-nodes) that explicitly specify production rules to capture shape variations. We thus define the model with three layers: the leaf-nodes for detecting local contour fragments, the or-nodes specifying selection of leaf-nodes, and the root-node encoding the holistic distortion. In the training stage, for optimization of the And-Or tree learning, we extend the concave-convex procedure (CCCP) by embedding the structural clustering during the iterative learning steps. The inference of shape detection is consistent with the model optimization, which integrates the local testings via the leaf-nodes and or-nodes with the global verification via the root-node. The advantages of our approach are validated on the challenging shape databases (i.e., ETHZ and INRIA Horse) and summarized as follows. (1) The proposed method is able to accurately localize shape contours against unreliable edge detection and edge tracing. (2) The And-Or tree model enables us to well capture the intraclass variance.

* Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on , vol., no., pp.135,142, 16-21 June 2012  
* 8 pages, 7 figures, CVPR 2012 
Viaarxiv icon

Deep Boosting: Layered Feature Mining for General Image Classification

Feb 03, 2015
Zhanglin Peng, Liang Lin, Ruimao Zhang, Jing Xu

Figure 1 for Deep Boosting: Layered Feature Mining for General Image Classification
Figure 2 for Deep Boosting: Layered Feature Mining for General Image Classification
Figure 3 for Deep Boosting: Layered Feature Mining for General Image Classification
Figure 4 for Deep Boosting: Layered Feature Mining for General Image Classification

Constructing effective representations is a critical but challenging problem in multimedia understanding. The traditional handcraft features often rely on domain knowledge, limiting the performances of exiting methods. This paper discusses a novel computational architecture for general image feature mining, which assembles the primitive filters (i.e. Gabor wavelets) into compositional features in a layer-wise manner. In each layer, we produce a number of base classifiers (i.e. regression stumps) associated with the generated features, and discover informative compositions by using the boosting algorithm. The output compositional features of each layer are treated as the base components to build up the next layer. Our framework is able to generate expressive image representations while inducing very discriminate functions for image classification. The experiments are conducted on several public datasets, and we demonstrate superior performances over state-of-the-art approaches.

* Multimedia and Expo (ICME), 2014 IEEE International Conference on , vol., no., pp.1,6, 14-18 July 2014  
* 6 pages, 4 figures, ICME 2014 
Viaarxiv icon