Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matus Telgarsky

UCSD

Representation Benefits of Deep Feedforward Networks

Sep 29, 2015

Matus Telgarsky

Figure 1 for Representation Benefits of Deep Feedforward Networks

Abstract:This note provides a family of classification problems, indexed by a positive integer $k$, where all shallow networks with fewer than exponentially (in $k$) many nodes exhibit error at least $1/6$, whereas a deep network with 2 nodes in each of $2k$ layers achieves zero error, as does a recurrent network with 3 distinct nodes iterated $k$ times. The proof is elementary, and the networks are standard feedforward networks with ReLU (Rectified Linear Unit) nonlinearities.

Via

Access Paper or Ask Questions

Convex Risk Minimization and Conditional Probability Estimation

Jun 15, 2015

Matus Telgarsky, Miroslav Dudík, Robert Schapire

Figure 1 for Convex Risk Minimization and Conditional Probability Estimation

Figure 2 for Convex Risk Minimization and Conditional Probability Estimation

Figure 3 for Convex Risk Minimization and Conditional Probability Estimation

Figure 4 for Convex Risk Minimization and Conditional Probability Estimation

Abstract:This paper proves, in very general settings, that convex risk minimization is a procedure to select a unique conditional probability model determined by the classification problem. Unlike most previous work, we give results that are general enough to include cases in which no minimum exists, as occurs typically, for instance, with standard boosting algorithms. Concretely, we first show that any sequence of predictors minimizing convex risk over the source distribution will converge to this unique model when the class of predictors is linear (but potentially of infinite dimension). Secondly, we show the same result holds for \emph{empirical} risk minimization whenever this class of predictors is finite dimensional, where the essential technical contribution is a norm-free generalization bound.

* To appear, COLT 2015

Via

Access Paper or Ask Questions

Tensor decompositions for learning latent variable models

Nov 13, 2014

Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, Matus Telgarsky

Figure 1 for Tensor decompositions for learning latent variable models

Abstract:This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.

* Journal of Machine Learning Research, 15(Aug):2773-2832, 2014

Via

Access Paper or Ask Questions

Scalable Nonlinear Learning with Adaptive Polynomial Expansions

Oct 02, 2014

Alekh Agarwal, Alina Beygelzimer, Daniel Hsu, John Langford, Matus Telgarsky

Figure 1 for Scalable Nonlinear Learning with Adaptive Polynomial Expansions

Figure 2 for Scalable Nonlinear Learning with Adaptive Polynomial Expansions

Figure 3 for Scalable Nonlinear Learning with Adaptive Polynomial Expansions

Figure 4 for Scalable Nonlinear Learning with Adaptive Polynomial Expansions

Abstract:Can we effectively learn a nonlinear representation in time comparable to linear learning? We describe a new algorithm that explicitly and adaptively expands higher-order interaction features over base linear representations. The algorithm is designed for extreme computational efficiency, and an extensive experimental study shows that its computation/prediction tradeoff ability compares very favorably against strong baselines.

* To appear in NIPS 2014

Via

Access Paper or Ask Questions

Moment-based Uniform Deviation Bounds for $k$-means and Friends

Nov 08, 2013

Matus Telgarsky, Sanjoy Dasgupta

Abstract:Suppose $k$ centers are fit to $m$ points by heuristically minimizing the $k$-means cost; what is the corresponding fit over the source distribution? This question is resolved here for distributions with $p\geq 4$ bounded moments; in particular, the difference between the sample cost and distribution cost decays with $m$ and $p$ as $m^{\min\{-1/4, -1/2+2/p\}}$. The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. To further demonstrate this mechanism, a soft clustering variant of $k$-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. Lastly, a rate with refined constants is provided for $k$-means instances possessing some cluster structure.

* To appear, NIPS 2013

Via

Access Paper or Ask Questions

Boosting with the Logistic Loss is Consistent

May 13, 2013

Matus Telgarsky

Abstract:This manuscript provides optimization guarantees, generalization bounds, and statistical consistency results for AdaBoost variants which replace the exponential loss with the logistic and similar losses (specifically, twice differentiable convex losses which are Lipschitz and tend to zero on one side). The heart of the analysis is to show that, in lieu of explicit regularization and constraints, the structure of the problem is fairly rigidly controlled by the source distribution itself. The first control of this type is in the separable case, where a distribution-dependent relaxed weak learning rate induces speedy convergence with high probability over any sample. Otherwise, in the nonseparable case, the convex surrogate risk itself exhibits distribution-dependent levels of curvature, and consequently the algorithm's output has small norm with high probability.

* To appear, COLT 2013

Via

Access Paper or Ask Questions

Margins, Shrinkage, and Boosting

Mar 18, 2013

Matus Telgarsky

Figure 1 for Margins, Shrinkage, and Boosting

Figure 2 for Margins, Shrinkage, and Boosting

Figure 3 for Margins, Shrinkage, and Boosting

Abstract:This manuscript shows that AdaBoost and its immediate variants can produce approximate maximum margin classifiers simply by scaling step size choices with a fixed small constant. In this way, when the unscaled step size is an optimal choice, these results provide guarantees for Friedman's empirically successful "shrinkage" procedure for gradient boosting (Friedman, 2000). Guarantees are also provided for a variety of other step sizes, affirming the intuition that increasingly regularized line searches provide improved margin guarantees. The results hold for the exponential loss and similar losses, most notably the logistic loss.

* To appear, ICML 2013

Via

Access Paper or Ask Questions

Dirichlet draws are sparse with high probability

Jan 21, 2013

Matus Telgarsky

Figure 1 for Dirichlet draws are sparse with high probability

Abstract:This note provides an elementary proof of the folklore fact that draws from a Dirichlet distribution (with parameters less than 1) are typically sparse (most coordinates are small).

* 4 pages

Via

Access Paper or Ask Questions

Agglomerative Bregman Clustering

Jun 27, 2012

Matus Telgarsky, Sanjoy Dasgupta

Figure 1 for Agglomerative Bregman Clustering

Figure 2 for Agglomerative Bregman Clustering

Figure 3 for Agglomerative Bregman Clustering

Abstract:This manuscript develops the theory of agglomerative clustering with Bregman divergences. Geometric smoothing techniques are developed to deal with degenerate clusters. To allow for cluster models based on exponential families with overcomplete representations, Bregman divergences are developed for nondifferentiable convex functions.

* Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)

Via

Access Paper or Ask Questions

Statistical Consistency of Finite-dimensional Unregularized Linear Classification

Jun 14, 2012

Matus Telgarsky

Figure 1 for Statistical Consistency of Finite-dimensional Unregularized Linear Classification

Abstract:This manuscript studies statistical properties of linear classifiers obtained through minimization of an unregularized convex risk over a finite sample. Although the results are explicitly finite-dimensional, inputs may be passed through feature maps; in this way, in addition to treating the consistency of logistic regression, this analysis also handles boosting over a finite weak learning class with, for instance, the exponential, logistic, and hinge losses. In this finite-dimensional setting, it is still possible to fit arbitrary decision boundaries: scaling the complexity of the weak learning class with the sample size leads to the optimal classification risk almost surely.

Via

Access Paper or Ask Questions