We propose maximum likelihood estimation for learning Gaussian graphical models with a Gaussian (ell_2^2) prior on the parameters. This is in contrast to the commonly used Laplace (ell_1) prior for encouraging sparseness. We show that our optimization problem leads to a Riccati matrix equation, which has a closed form solution. We propose an efficient algorithm that performs a singular value decomposition of the training data. Our algorithm is O(NT^2)-time and O(NT)-space for N variables and T samples. Our method is tailored to high-dimensional problems (N gg T), in which sparseness promoting methods become intractable. Furthermore, instead of obtaining a single solution for a specific regularization parameter, our algorithm finds the whole solution path. We show that the method has logarithmic sample complexity under the spiked covariance model. We also propose sparsification of the dense solution with provable performance guarantees. We provide techniques for using our learnt models, such as removing unimportant variables, computing likelihoods and conditional distributions. Finally, we show promising results in several gene expressions datasets.
We present deterministic techniques for computing upper and lower bounds on marginal probabilities in sigmoid and noisy-OR networks. These techniques become useful when the size of the network (or clique size) precludes exact computations. We illustrate the tightness of the bounds by numerical experiments.
In this paper we present decomposable priors, a family of priors over structure and parameters of tree belief nets for which Bayesian learning with complete observations is tractable, in the sense that the posterior is also decomposable and can be completely determined analytically in polynomial time. This follows from two main results: First, we show that factored distributions over spanning trees in a graph can be integrated in closed form. Second, we examine priors over tree parameters and show that a set of assumptions similar to (Heckerman and al. 1995) constrain the tree parameter priors to be a compactly parameterized product of Dirichlet distributions. Beside allowing for exact Bayesian learning, these results permit us to formulate a new class of tractable latent variable models in which the likelihood of a data point is computed through an ensemble average over tree structures.
Incorporating feature selection into a classification or regression method often carries a number of advantages. In this paper we formalize feature selection specifically from a discriminative perspective of improving classification/regression accuracy. The feature selection method is developed as an extension to the recently proposed maximum entropy discrimination (MED) framework. We describe MED as a flexible (Bayesian) regularization approach that subsumes, e.g., support vector classification, regression and exponential family models. For brevity, we restrict ourselves primarily to feature selection in the context of linear classification/regression methods and demonstrate that the proposed approach indeed carries substantial improvements in practice. Moreover, we discuss and develop various extensions of feature selection, including the problem of dealing with example specific but unobserved degrees of freedom -- alignments or invariants.
Bounds on the log partition function are important in a variety of contexts, including approximate inference, model fitting, decision theory, and large deviations analysis. We introduce a new class of upper bounds on the log partition function, based on convex combinations of distributions in the exponential domain, that is applicable to an arbitrary undirected graphical model. In the special case of convex combinations of tree-structured distributions, we obtain a family of variational problems, similar to the Bethe free energy, but distinguished by the following desirable properties: i. they are cnvex, and have a unique global minimum; and ii. the global minimum gives an upper bound on the log partition function. The global minimum is defined by stationary conditions very similar to those defining fixed points of belief propagation or tree-based reparameterization Wainwright et al., 2001. As with BP fixed points, the elements of the minimizing argument can be used as approximations to the marginals of the original model. The analysis described here can be extended to structures of higher treewidth e.g., hypertrees, thereby making connections with more advanced approximations e.g., Kikuchi and variants Yedidia et al., 2001; Minka, 2001.
Active learning is a powerful approach to analyzing data effectively. We show that the feasibility of active learning depends crucially on the choice of measure with respect to which the query is being optimized. The standard information gain, for example, does not permit an accurate evaluation with a small committee, a representative subset of the model space. We propose a surrogate measure requiring only a small committee and discuss the properties of this new measure. We devise, in addition, a bootstrap approach for committee selection. The advantages of this approach are illustrated in the context of recovering (regulatory) network models.
A number of modern learning tasks involve estimation from heterogeneous information sources. This includes classification with labeled and unlabeled data as well as other problems with analogous structure such as competitive (game theoretic) problems. The associated estimation problems can be typically reduced to solving a set of fixed point equations (consistency conditions). We introduce a general method for combining a preferred information source with another in this setting by evolving continuous paths of fixed points at intermediate allocations. We explicitly identify critical points along the unique paths to either increase the stability of estimation or to ensure a significant departure from the initial source. The homotopy continuation approach is guaranteed to terminate at the second source, and involves no combinatorial effort. We illustrate the power of these ideas both in classification tasks with labeled and unlabeled data, as well as in the context of a competitive (min-max) formulation of DNA sequence motif discovery.
We formulate a principle for classification with the knowledge of the marginal distribution over the data points (unlabeled data). The principle is cast in terms of Tikhonov style regularization where the regularization penalty articulates the way in which the marginal density should constrain otherwise unrestricted conditional distributions. Specifically, the regularization penalty penalizes any information introduced between the examples and labels beyond what is provided by the available labeled examples. The work extends Szummer and Jaakkola's information regularization (NIPS 2002) to multiple dimensions, providing a regularizer independent of the covering of the space used in the derivation. We show in addition how the information regularizer can be used as a measure of complexity of the classification task with unlabeled data and prove a relevant sample-complexity bound. We illustrate the regularization principle in practice by restricting the class of conditional distributions to be logistic regression models and constructing the regularization penalty from a finite set of unlabeled examples.
Inference problems in graphical models are often approximated by casting them as constrained optimization problems. Message passing algorithms, such as belief propagation, have previously been suggested as methods for solving these optimization problems. However, there are few convergence guarantees for such algorithms, and the algorithms are therefore not guaranteed to solve the corresponding optimization problem. Here we present an oriented tree decomposition algorithm that is guaranteed to converge to the global optimum of the Tree-Reweighted (TRW) variational problem. Our algorithm performs local updates in the convex dual of the TRW problem - an unconstrained generalized geometric program. Primal updates, also local, correspond to oriented reparametrization operations that leave the distribution intact.
Linear Programming (LP) relaxations have become powerful tools for finding the most probable (MAP) configuration in graphical models. These relaxations can be solved efficiently using message-passing algorithms such as belief propagation and, when the relaxation is tight, provably find the MAP configuration. The standard LP relaxation is not tight enough in many real-world problems, however, and this has lead to the use of higher order cluster-based LP relaxations. The computational cost increases exponentially with the size of the clusters and limits the number and type of clusters we can use. We propose to solve the cluster selection problem monotonically in the dual LP, iteratively selecting clusters with guaranteed improvement, and quickly re-solving with the added clusters by reusing the existing solution. Our dual message-passing algorithm finds the MAP configuration in protein sidechain placement, protein design, and stereo problems, in cases where the standard LP relaxation fails.