Conditional Random Fields (CRF) have been widely used in a variety of computer vision tasks. Conventional CRFs typically define edges on neighboring image pixels, resulting in a sparse graph such that efficient inference can be performed. However, these CRFs fail to model long-range contextual relationships. Fully-connected CRFs have thus been proposed. While there are efficient approximate inference methods for such CRFs, usually they are sensitive to initialization and make strong assumptions. In this work, we develop an efficient, yet general algorithm for inference on fully-connected CRFs. The algorithm is based on a scalable SDP algorithm and the low- rank approximation of the similarity/kernel matrix. The core of the proposed algorithm is a tailored quasi-Newton method that takes advantage of the low-rank matrix approximation when solving the specialized SDP dual problem. Experiments demonstrate that our method can be applied on fully-connected CRFs that cannot be solved previously, such as pixel-level image co-segmentation.
Conditional Random Rields (CRF) have been widely applied in image segmentations. While most studies rely on hand-crafted features, we here propose to exploit a pre-trained large convolutional neural network (CNN) to generate deep features for CRF learning. The deep CNN is trained on the ImageNet dataset and transferred to image segmentations here for constructing potentials of superpixels. Then the CRF parameters are learnt using a structured support vector machine (SSVM). To fully exploit context information in inference, we construct spatially related co-occurrence pairwise potentials and incorporate them into the energy function. This prefers labelling of object pairs that frequently co-occur in a certain spatial layout and at the same time avoids implausible labellings during the inference. Extensive experiments on binary and multi-class segmentation benchmarks demonstrate the promise of the proposed method. We thus provide new baselines for the segmentation performance on the Weizmann horse, Graz-02, MSRC-21, Stanford Background and PASCAL VOC 2011 datasets.
We propose an effective structured learning based approach to the problem of person re-identification which outperforms the current state-of-the-art on most benchmark data sets evaluated. Our framework is built on the basis of multiple low-level hand-crafted and high-level visual features. We then formulate two optimization algorithms, which directly optimize evaluation measures commonly used in person re-identification, also known as the Cumulative Matching Characteristic (CMC) curve. Our new approach is practical to many real-world surveillance applications as the re-identification performance can be concentrated in the range of most practical importance. The combination of these factors leads to a person re-identification system which outperforms most existing algorithms. More importantly, we advance state-of-the-art results on person re-identification by improving the rank-$1$ recognition rates from $40\%$ to $50\%$ on the iLIDS benchmark, $16\%$ to $18\%$ on the PRID2011 benchmark, $43\%$ to $46\%$ on the VIPeR benchmark, $34\%$ to $53\%$ on the CUHK01 benchmark and $21\%$ to $62\%$ on the CUHK03 benchmark.
Embedding image features into a binary Hamming space can improve both the speed and accuracy of large-scale query-by-example image retrieval systems. Supervised hashing aims to map the original features to compact binary codes in a manner which preserves the label-based similarities of the original data. Most existing approaches apply a single form of hash function, and an optimization process which is typically deeply coupled to this specific form. This tight coupling restricts the flexibility of those methods, and can result in complex optimization problems that are difficult to solve. In this work we proffer a flexible yet simple framework that is able to accommodate different types of loss functions and hash functions. The proposed framework allows a number of existing approaches to hashing to be placed in context, and simplifies the development of new problem-specific hashing methods. Our framework decomposes the into two steps: binary code (hash bits) learning, and hash function learning. The first step can typically be formulated as a binary quadratic problem, and the second step can be accomplished by training standard binary classifiers. For solving large-scale binary code inference, we show how to ensure that the binary quadratic problems are submodular such that an efficient graph cut approach can be used. To achieve efficiency as well as efficacy on large-scale high-dimensional data, we propose to use boosted decision trees as the hash functions, which are nonlinear, highly descriptive, and very fast to train and evaluate. Experiments demonstrate that our proposed method significantly outperforms most state-of-the-art methods, especially on high-dimensional data.
We consider the problem of depth estimation from a single monocular image in this work. It is a challenging task as no reliable depth cues are available, e.g., stereo correspondences, motions, etc. Previous efforts have been focusing on exploiting geometric priors or additional sources of information, with all using hand-crafted features. Recently, there is mounting evidence that features from deep convolutional neural networks (CNN) are setting new records for various vision applications. On the other hand, considering the continuous characteristic of the depth values, depth estimations can be naturally formulated into a continuous conditional random field (CRF) learning problem. Therefore, we in this paper present a deep convolutional neural field model for estimating depths from a single image, aiming to jointly explore the capacity of deep CNN and continuous CRF. Specifically, we propose a deep structured learning scheme which learns the unary and pairwise potentials of continuous CRF in a unified deep CNN framework. The proposed method can be used for depth estimations of general scenes with no geometric priors nor any extra information injected. In our case, the integral of the partition function can be analytically calculated, thus we can exactly solve the log-likelihood optimization. Moreover, solving the MAP problem for predicting depths of a new image is highly efficient as closed-form solutions exist. We experimentally demonstrate that the proposed method outperforms state-of-the-art depth estimation methods on both indoor and outdoor scene datasets.
Learning based hashing methods have attracted considerable attention due to their ability to greatly increase the scale at which existing algorithms may operate. Most of these methods are designed to generate binary codes preserving the Euclidean similarity in the original space. Manifold learning techniques, in contrast, are better able to model the intrinsic structure embedded in the original high-dimensional data. The complexities of these models, and the problems with out-of-sample data, have previously rendered them unsuitable for application to large-scale embedding, however. In this work, how to learn compact binary embeddings on their intrinsic manifolds is considered. In order to address the above-mentioned difficulties, an efficient, inductive solution to the out-of-sample data problem, and a process by which non-parametric manifold learning may be used as the basis of a hashing method is proposed. The proposed approach thus allows the development of a range of new hashing techniques exploiting the flexibility of the wide variety of manifold learning approaches available. It is particularly shown that hashing on the basis of t-SNE outperforms state-of-the-art hashing methods on large-scale benchmark datasets, and is very effective for image classification with very short code lengths. The proposed hashing framework is shown to be easily improved, for example, by minimizing the quantization error with learned orthogonal rotations. In addition, a supervised inductive manifold hashing framework is developed by incorporating the label information, which is shown to greatly advance the semantic retrieval performance.
A number of recent studies have shown that a Deep Convolutional Neural Network (DCNN) pretrained on a large dataset can be adopted as a universal image description which leads to astounding performance in many visual classification tasks. Most of these studies, if not all, adopt activations of the fully-connected layer of a DCNN as the image or region representation and it is believed that convolutional layer activations are less discriminative. This paper, however, advocates that if used appropriately convolutional layer activations can be turned into a powerful image representation which enjoys many advantages over fully-connected layer activations. This is achieved by adopting a new technique proposed in this paper called cross-convolutional-layer pooling. More specifically, it extracts subarrays of feature maps of one convolutional layer as local features and pools the extracted features with the guidance of feature maps of the successive convolutional layer. Compared with exising methods that apply DCNNs in the local feature setting, the proposed method is significantly faster since it requires much fewer times of DCNN forward computation. Moreover, it avoids the domain mismatch issue which is usually encountered when applying fully connected layer activations to describe local regions. By applying our method to four popular visual classification tasks, it is demonstrated that the proposed method can achieve comparable or in some cases significantly better performance than existing fully-connected layer based image representations while incurring much lower computational cost.
In this paper, we propose an efficient semidefinite programming (SDP) approach to worst-case linear discriminant analysis (WLDA). Compared with the traditional LDA, WLDA considers the dimensionality reduction problem from the worst-case viewpoint, which is in general more robust for classification. However, the original problem of WLDA is non-convex and difficult to optimize. In this paper, we reformulate the optimization problem of WLDA into a sequence of semidefinite feasibility problems. To efficiently solve the semidefinite feasibility problems, we design a new scalable optimization method with quasi-Newton methods and eigen-decomposition being the core components. The proposed method is orders of magnitude faster than standard interior-point based SDP solvers. Experiments on a variety of classification problems demonstrate that our approach achieves better performance than standard LDA. Our method is also much faster and more scalable than standard interior-point SDP solvers based WLDA. The computational complexity for an SDP with $m$ constraints and matrices of size $d$ by $d$ is roughly reduced from $\mathcal{O}(m^3+md^3+m^2d^2)$ to $\mathcal{O}(d^3)$ ($m>d$ in our case).
Deriving from the gradient vector of a generative model of local features, Fisher vector coding (FVC) has been identified as an effective coding method for image classification. Most, if not all, % FVC implementations employ the Gaussian mixture model (GMM) to characterize the generation process of local features. This choice has shown to be sufficient for traditional low dimensional local features, e.g., SIFT; and typically, good performance can be achieved with only a few hundred Gaussian distributions. However, the same number of Gaussians is insufficient to model the feature space spanned by higher dimensional local features, which have become popular recently. In order to improve the modeling capacity for high dimensional features, it turns out to be inefficient and computationally impractical to simply increase the number of Gaussians. In this paper, we propose a model in which each local feature is drawn from a Gaussian distribution whose mean vector is sampled from a subspace. With certain approximation, this model can be converted to a sparse coding procedure and the learning/inference problems can be readily solved by standard sparse coding methods. By calculating the gradient vector of the proposed model, we derive a new fisher vector encoding strategy, termed Sparse Coding based Fisher Vector Coding (SCFVC). Moreover, we adopt the recently developed Deep Convolutional Neural Network (CNN) descriptor as a high dimensional local feature and implement image classification with the proposed SCFVC. Our experimental evaluations demonstrate that our method not only significantly outperforms the traditional GMM based Fisher vector encoding but also achieves the state-of-the-art performance in generic object recognition, indoor scene, and fine-grained image classification problems.
Automatic face recognition has received significant performance improvement by developing specialised facial image representations. On the other hand, generic object recognition has rarely been applied to the face recognition. Spatial pyramid pooling of features encoded by an over-complete dictionary has been the key component of many state-of-the-art image classification systems. Inspired by its success, in this work we develop a new face image representation method inspired by the second-order pooling in Carreira et al. [1], which was originally proposed for image segmentation. The proposed method differs from the previous methods in that, we encode the densely extracted local patches by a small-size dictionary; and the facial image signatures are obtained by pooling the second-order statistics of the encoded features. We show the importance of pooling on encoded features, which is bypassed by the original second-order pooling method to avoid the high computational cost. Equipped with a simple linear classifier, the proposed method outperforms the state-of-the-art face identification performance by large margins. For example, on the LFW databases, the proposed method performs better than the previous best by around 13% accuracy.