Michigan State University
Abstract:We study the non-smooth optimization problems in machine learning, where both the loss function and the regularizer are non-smooth functions. Previous studies on efficient empirical loss minimization assume either a smooth loss function or a strongly convex regularizer, making them unsuitable for non-smooth optimization. We develop a simple yet efficient method for a family of non-smooth optimization problems where the dual form of the loss function is bilinear in primal and dual variables. We cast a non-smooth optimization problem into a minimax optimization problem, and develop a primal dual prox method that solves the minimax optimization problem at a rate of $O(1/T)$ {assuming that the proximal step can be efficiently solved}, significantly faster than a standard subgradient descent method that has an $O(1/\sqrt{T})$ convergence rate. Our empirical study verifies the efficiency of the proposed method for various non-smooth optimization problems that arise ubiquitously in machine learning by comparing it to the state-of-the-art first order methods.
Abstract:In this paper we propose a general framework to characterize and solve the stochastic optimization problems with multiple objectives underlying many real world learning applications. We first propose a projection based algorithm which attains an $O(T^{-1/3})$ convergence rate. Then, by leveraging on the theory of Lagrangian in constrained optimization, we devise a novel primal-dual stochastic approximation algorithm which attains the optimal convergence rate of $O(T^{-1/2})$ for general Lipschitz continuous objectives.
Abstract:In this paper we consider learning in passive setting but with a slight modification. We assume that the target expected loss, also referred to as target risk, is provided in advance for learner as prior knowledge. Unlike most studies in the learning theory that only incorporate the prior knowledge into the generalization bounds, we are able to explicitly utilize the target risk in the learning process. Our analysis reveals a surprising result on the sample complexity of learning: by exploiting the target risk in the learning algorithm, we show that when the loss function is both strongly convex and smooth, the sample complexity reduces to $\O(\log (\frac{1}{\epsilon}))$, an exponential improvement compared to the sample complexity $\O(\frac{1}{\epsilon})$ for learning with strongly convex loss functions. Furthermore, our proof is constructive and is based on a computationally efficient stochastic optimization algorithm for such settings which demonstrate that the proposed algorithm is practically useful.
Abstract:In this paper, we study the problem of sparse multiple kernel learning (MKL), where the goal is to efficiently learn a combination of a fixed small number of kernels from a large pool that could lead to a kernel classifier with a small prediction error. We develop an efficient algorithm based on the greedy coordinate descent algorithm, that is able to achieve a geometric convergence rate under appropriate conditions. The convergence rate is achieved by measuring the size of functional gradients by an empirical $\ell_2$ norm that depends on the empirical data distribution. This is in contrast to previous algorithms that use a functional norm to measure the size of gradients, which is independent from the data samples. We also establish a generalization error bound of the learned sparse kernel classifier using the technique of local Rademacher complexity.
Abstract:Online learning constitutes a mathematical and compelling framework to analyze sequential decision making problems in adversarial environments. The learner repeatedly chooses an action, the environment responds with an outcome, and then the learner receives a reward for the played action. The goal of the learner is to maximize his total reward. However, there are situations in which, in addition to maximizing the cumulative reward, there are some additional constraints on the sequence of decisions that must be satisfied on average by the learner. In this paper we study an extension to the online learning where the learner aims to maximize the total reward given that some additional constraints need to be satisfied. By leveraging on the theory of Lagrangian method in constrained optimization, we propose Lagrangian exponentially weighted average (LEWA) algorithm, which is a primal-dual variant of the well known exponentially weighted average algorithm, to efficiently solve constrained online decision making problems. Using novel theoretical analysis, we establish the regret and the violation of the constraint bounds in full information and bandit feedback models.
Abstract:In this paper we propose a framework for solving constrained online convex optimization problem. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set $\mathcal{K}$ from which the decisions are made. While for simple shapes (e.g. Euclidean ball) the projection is straightforward, for arbitrary complex sets this is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring decisions belong to $\mathcal{K}$ for all rounds, we only require that the constraints which define the set $\mathcal{K}$ be satisfied in the long run. We show that our framework can be utilized to solve a relaxed version of online learning with side constraints addressed in \cite{DBLP:conf/colt/MannorT06} and \cite{DBLP:conf/aaai/KvetonYTM08}. By turning the problem into an online convex-concave optimization problem, we propose an efficient algorithm which achieves $\tilde{\mathcal{O}}(\sqrt{T})$ regret bound and $\tilde{\mathcal{O}}(T^{3/4})$ bound for the violation of constraints. Then we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting $\tilde{\mathcal{O}}(T^{3/4})$ regret bound. Our second algorithm is based on the Mirror Prox method \citep{nemirovski-2005-prox} to solve variational inequalities which achieves $\tilde{\mathcal{\mathcal{O}}}(T^{2/3})$ bound for both regret and the violation of constraints when the domain $\K$ can be described by a finite number of linear constraints. Finally, we extend the result to the setting where we only have partial access to the convex set $\mathcal{K}$ and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm.
Abstract:We develop an improved bound for the approximation error of the Nystr\"{o}m method under the assumption that there is a large eigengap in the spectrum of kernel matrix. This is based on the empirical observation that the eigengap has a significant impact on the approximation error of the Nystr\"{o}m method. Our approach is based on the concentration inequality of integral operator and the theory of matrix perturbation. Our analysis shows that when there is a large eigengap, we can improve the approximation error of the Nystr\"{o}m method from $O(N/m^{1/4})$ to $O(N/m^{1/2})$ when measured in Frobenius norm, where $N$ is the size of the kernel matrix, and $m$ is the number of sampled columns.
Abstract:We develop two approaches for analyzing the approximation error bound for the Nystr\"{o}m method, one based on the concentration inequality of integral operator, and one based on the compressive sensing theory. We show that the approximation error, measured in the spectral norm, can be improved from $O(N/\sqrt{m})$ to $O(N/m^{1 - \rho})$ in the case of large eigengap, where $N$ is the total number of data points, $m$ is the number of sampled data points, and $\rho \in (0, 1/2)$ is a positive constant that characterizes the eigengap. When the eigenvalues of the kernel matrix follow a $p$-power law, our analysis based on compressive sensing theory further improves the bound to $O(N/m^{p - 1})$ under an incoherence assumption, which explains why the Nystr\"{o}m method works well for kernel matrix with skewed eigenvalues. We present a kernel classification approach based on the Nystr\"{o}m method and derive its generalization performance using the improved bound. We show that when the eigenvalues of kernel matrix follow a $p$-power law, we can reduce the number of support vectors to $N^{2p/(p^2 - 1)}$, a number less than $N$ when $p > 1+\sqrt{2}$, without seriously sacrificing its generalization performance.
Abstract:We study the problem of multiple kernel learning from noisy labels. This is in contrast to most of the previous studies on multiple kernel learning that mainly focus on developing efficient algorithms and assume perfectly labeled training examples. Directly applying the existing multiple kernel learning algorithms to noisily labeled examples often leads to suboptimal performance due to the incorrect class assignments. We address this challenge by casting multiple kernel learning from noisy labels into a stochastic programming problem, and presenting a minimax formulation. We develop an efficient algorithm for solving the related convex-concave optimization problem with a fast convergence rate of $O(1/T)$ where $T$ is the number of iterations. Empirical studies on UCI data sets verify both the effectiveness of the proposed framework and the efficiency of the proposed optimization algorithm.
Abstract:In citep{Hazan-2008-extract}, the authors showed that the regret of online linear optimization can be bounded by the total variation of the cost vectors. In this paper, we extend this result to general online convex optimization. We first analyze the limitations of the algorithm in \citep{Hazan-2008-extract} when applied it to online convex optimization. We then present two algorithms for online convex optimization whose regrets are bounded by the variation of cost functions. We finally consider the bandit setting, and present a randomized algorithm for online bandit convex optimization with a variation-based regret bound. We show that the regret bound for online bandit convex optimization is optimal when the variation of cost functions is independent of the number of trials.