Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anastasios Kyrillidis

Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach

Sep 27, 2016
Dohyung Park, Anastasios Kyrillidis, Constantine Caramanis, Sujay Sanghavi

We consider the non-square matrix sensing problem, under restricted isometry property (RIP) assumptions. We focus on the non-convex formulation, where any rank-$r$ matrix $X \in \mathbb{R}^{m \times n}$ is represented as $UV^\top$, where $U \in \mathbb{R}^{m \times r}$ and $V \in \mathbb{R}^{n \times r}$. In this paper, we complement recent findings on the non-convex geometry of the analogous PSD setting [5], and show that matrix factorization does not introduce any spurious local minima, under RIP.

* 14 pages, no figures

Via

Access Paper or Ask Questions

A simple and provable algorithm for sparse diagonal CCA

May 29, 2016
Megasthenis Asteris, Anastasios Kyrillidis, Oluwasanmi Koyejo, Russell Poldrack

Figure 1 for A simple and provable algorithm for sparse diagonal CCA

Figure 2 for A simple and provable algorithm for sparse diagonal CCA

Figure 3 for A simple and provable algorithm for sparse diagonal CCA

Given two sets of variables, derived from a common set of samples, sparse Canonical Correlation Analysis (CCA) seeks linear combinations of a small number of variables in each set, such that the induced canonical variables are maximally correlated. Sparse CCA is NP-hard. We propose a novel combinatorial algorithm for sparse diagonal CCA, i.e., sparse CCA under the additional assumption that variables within each set are standardized and uncorrelated. Our algorithm operates on a low rank approximation of the input data and its computational complexity scales linearly with the number of input variables. It is simple to implement, and parallelizable. In contrast to most existing approaches, our algorithm administers precise control on the sparsity of the extracted canonical vectors, and comes with theoretical data-dependent global approximation guarantees, that hinge on the spectrum of the input data. Finally, it can be straightforwardly adapted to other constrained variants of CCA enforcing structure beyond sparsity. We empirically evaluate the proposed scheme and apply it on a real neuroimaging dataset to investigate associations between brain activity and behavior measurements.

* To appear at ICML 2016, 14 pages, 4 figures

Via

Access Paper or Ask Questions

Learning Sparse Additive Models with Interactions in High Dimensions

Apr 18, 2016
Hemant Tyagi, Anastasios Kyrillidis, Bernd Gärtner, Andreas Krause

Figure 1 for Learning Sparse Additive Models with Interactions in High Dimensions

Figure 2 for Learning Sparse Additive Models with Interactions in High Dimensions

Figure 3 for Learning Sparse Additive Models with Interactions in High Dimensions

Figure 4 for Learning Sparse Additive Models with Interactions in High Dimensions

A function $f: \mathbb{R}^d \rightarrow \mathbb{R}$ is referred to as a Sparse Additive Model (SPAM), if it is of the form $f(\mathbf{x}) = \sum_{l \in \mathcal{S}}\phi_{l}(x_l)$, where $\mathcal{S} \subset [d]$, $|\mathcal{S}| \ll d$. Assuming $\phi_l$'s and $\mathcal{S}$ to be unknown, the problem of estimating $f$ from its samples has been studied extensively. In this work, we consider a generalized SPAM, allowing for second order interaction terms. For some $\mathcal{S}_1 \subset [d], \mathcal{S}_2 \subset {[d] \choose 2}$, the function $f$ is assumed to be of the form: $$f(\mathbf{x}) = \sum_{p \in \mathcal{S}_1}\phi_{p} (x_p) + \sum_{(l,l^{\prime}) \in \mathcal{S}_2}\phi_{(l,l^{\prime})} (x_{l},x_{l^{\prime}}).$$ Assuming $\phi_{p},\phi_{(l,l^{\prime})}$, $\mathcal{S}_1$ and, $\mathcal{S}_2$ to be unknown, we provide a randomized algorithm that queries $f$ and exactly recovers $\mathcal{S}_1,\mathcal{S}_2$. Consequently, this also enables us to estimate the underlying $\phi_p, \phi_{(l,l^{\prime})}$. We derive sample complexity bounds for our scheme and also extend our analysis to include the situation where the queries are corrupted with noise -- either stochastic, or arbitrary but bounded. Lastly, we provide simulation results on synthetic data, that validate our theoretical findings.

* 23 pages, to appear in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016

Via

Access Paper or Ask Questions

Dropping Convexity for Faster Semi-definite Optimization

Apr 16, 2016
Srinadh Bhojanapalli, Anastasios Kyrillidis, Sujay Sanghavi

Figure 1 for Dropping Convexity for Faster Semi-definite Optimization

Figure 2 for Dropping Convexity for Faster Semi-definite Optimization

Figure 3 for Dropping Convexity for Faster Semi-definite Optimization

Figure 4 for Dropping Convexity for Faster Semi-definite Optimization

We study the minimization of a convex function $f(X)$ over the set of $n\times n$ positive semi-definite matrices, but when the problem is recast as $\min_U g(U) := f(UU^\top)$, with $U \in \mathbb{R}^{n \times r}$ and $r \leq n$. We study the performance of gradient descent on $g$---which we refer to as Factored Gradient Descent (FGD)---under standard assumptions on the original function $f$. We provide a rule for selecting the step size and, with this choice, show that the local convergence rate of FGD mirrors that of standard gradient descent on the original $f$: i.e., after $k$ steps, the error is $O(1/k)$ for smooth $f$, and exponentially small in $k$ when $f$ is (restricted) strongly convex. In addition, we provide a procedure to initialize FGD for (restricted) strongly convex objectives and when one only has access to $f$ via a first-order oracle; for several problem instances, such proper initialization leads to global convergence guarantees. FGD and similar procedures are widely used in practice for problems that can be posed as matrix factorization. To the best of our knowledge, this is the first paper to provide precise convergence rate guarantees for general convex functions under standard convex assumptions.

* 40 pages

Via

Access Paper or Ask Questions

Convex block-sparse linear regression with expanders -- provably

Apr 03, 2016
Anastasios Kyrillidis, Bubacarr Bah, Rouzbeh Hasheminezhad, Quoc Tran-Dinh, Luca Baldassarre, Volkan Cevher

Figure 1 for Convex block-sparse linear regression with expanders -- provably

Figure 2 for Convex block-sparse linear regression with expanders -- provably

Sparse matrices are favorable objects in machine learning and optimization. When such matrices are used, in place of dense ones, the overall complexity requirements in optimization can be significantly reduced in practice, both in terms of space and run-time. Prompted by this observation, we study a convex optimization scheme for block-sparse recovery from linear measurements. To obtain linear sketches, we use expander matrices, i.e., sparse matrices containing only few non-zeros per column. Hitherto, to the best of our knowledge, such algorithmic solutions have been only studied from a non-convex perspective. Our aim here is to theoretically characterize the performance of convex approaches under such setting. Our key novelty is the expression of the recovery error in terms of the model-based norm, while assuring that solution lives in the model. To achieve this, we show that sparse model-based matrices satisfy a group version of the null-space property. Our experimental findings on synthetic and real applications support our claims for faster recovery in the convex setting -- as opposed to using dense sensing matrices, while showing a competitive recovery performance.

* 12 pages, 6 figures, to appear at AISTATS

Via

Access Paper or Ask Questions

Trading-off variance and complexity in stochastic gradient descent

Mar 22, 2016
Vatsal Shah, Megasthenis Asteris, Anastasios Kyrillidis, Sujay Sanghavi

Figure 1 for Trading-off variance and complexity in stochastic gradient descent

Figure 2 for Trading-off variance and complexity in stochastic gradient descent

Figure 3 for Trading-off variance and complexity in stochastic gradient descent

Figure 4 for Trading-off variance and complexity in stochastic gradient descent

Stochastic gradient descent is the method of choice for large-scale machine learning problems, by virtue of its light complexity per iteration. However, it lags behind its non-stochastic counterparts with respect to the convergence rate, due to high variance introduced by the stochastic updates. The popular Stochastic Variance-Reduced Gradient (SVRG) method mitigates this shortcoming, introducing a new update rule which requires infrequent passes over the entire input dataset to compute the full-gradient. In this work, we propose CheapSVRG, a stochastic variance-reduction optimization scheme. Our algorithm is similar to SVRG but instead of the full gradient, it uses a surrogate which can be efficiently computed on a small subset of the input data. It achieves a linear convergence rate ---up to some error level, depending on the nature of the optimization problem---and features a trade-off between the computational complexity and the convergence rate. Empirical evaluation shows that CheapSVRG performs at least competitively compared to the state of the art.

* 14 pages, 13 figures, first edition on 9th of October 2015

Via

Access Paper or Ask Questions

Bipartite Correlation Clustering -- Maximizing Agreements

Mar 09, 2016
Megasthenis Asteris, Anastasios Kyrillidis, Dimitris Papailiopoulos, Alexandros G. Dimakis

Figure 1 for Bipartite Correlation Clustering -- Maximizing Agreements

Figure 2 for Bipartite Correlation Clustering -- Maximizing Agreements

Figure 3 for Bipartite Correlation Clustering -- Maximizing Agreements

Figure 4 for Bipartite Correlation Clustering -- Maximizing Agreements

In Bipartite Correlation Clustering (BCC) we are given a complete bipartite graph $G$ with `+' and `-' edges, and we seek a vertex clustering that maximizes the number of agreements: the number of all `+' edges within clusters plus all `-' edges cut across clusters. BCC is known to be NP-hard. We present a novel approximation algorithm for $k$-BCC, a variant of BCC with an upper bound $k$ on the number of clusters. Our algorithm outputs a $k$-clustering that provably achieves a number of agreements within a multiplicative ${(1-\delta)}$-factor from the optimal, for any desired accuracy $\delta$. It relies on solving a combinatorially constrained bilinear maximization on the bi-adjacency matrix of $G$. It runs in time exponential in $k$ and $\delta^{-1}$, but linear in the size of the input. Further, we show that, in the (unconstrained) BCC setting, an ${(1-\delta)}$-approximation can be achieved by $O(\delta^{-1})$ clusters regardless of the size of the graph. In turn, our $k$-BCC algorithm implies an Efficient PTAS for the BCC objective of maximizing agreements.

* To appear in AISTATS 2016

Via

Access Paper or Ask Questions

Sparse PCA via Bipartite Matchings

Aug 04, 2015
Megasthenis Asteris, Dimitris Papailiopoulos, Anastasios Kyrillidis, Alexandros G. Dimakis

Figure 1 for Sparse PCA via Bipartite Matchings

Figure 2 for Sparse PCA via Bipartite Matchings

Figure 3 for Sparse PCA via Bipartite Matchings

Figure 4 for Sparse PCA via Bipartite Matchings

We consider the following multi-component sparse PCA problem: given a set of data points, we seek to extract a small number of sparse components with disjoint supports that jointly capture the maximum possible variance. These components can be computed one by one, repeatedly solving the single-component problem and deflating the input data matrix, but as we show this greedy procedure is suboptimal. We present a novel algorithm for sparse PCA that jointly optimizes multiple disjoint components. The extracted features capture variance that lies within a multiplicative factor arbitrarily close to 1 from the optimal. Our algorithm is combinatorial and computes the desired components by solving multiple instances of the bipartite maximum weight matching problem. Its complexity grows as a low order polynomial in the ambient dimension of the input data matrix, but exponentially in its rank. However, it can be effectively applied on a low-dimensional sketch of the data; this allows us to obtain polynomial-time approximation guarantees via spectral bounds. We evaluate our algorithm on real data-sets and empirically demonstrate that in many cases it outperforms existing, deflation-based approaches.

Via

Access Paper or Ask Questions

Linear Inverse Problems with Norm and Sparsity Constraints

Jul 20, 2015
Volkan Cevher, Sina Jafarpour, Anastasios Kyrillidis

Figure 1 for Linear Inverse Problems with Norm and Sparsity Constraints

Figure 2 for Linear Inverse Problems with Norm and Sparsity Constraints

Figure 3 for Linear Inverse Problems with Norm and Sparsity Constraints

Figure 4 for Linear Inverse Problems with Norm and Sparsity Constraints

We describe two nonconventional algorithms for linear regression, called GAME and CLASH. The salient characteristics of these approaches is that they exploit the convex $\ell_1$-ball and non-convex $\ell_0$-sparsity constraints jointly in sparse recovery. To establish the theoretical approximation guarantees of GAME and CLASH, we cover an interesting range of topics from game theory, convex and combinatorial optimization. We illustrate that these approaches lead to improved theoretical guarantees and empirical performance beyond convex and non-convex solvers alone.

* 21 pages, authors in alphabetical order

Via

Access Paper or Ask Questions

Structured Sparsity: Discrete and Convex approaches

Jul 20, 2015
Anastasios Kyrillidis, Luca Baldassarre, Marwa El-Halabi, Quoc Tran-Dinh, Volkan Cevher

Figure 1 for Structured Sparsity: Discrete and Convex approaches

Figure 2 for Structured Sparsity: Discrete and Convex approaches

Figure 3 for Structured Sparsity: Discrete and Convex approaches

Figure 4 for Structured Sparsity: Discrete and Convex approaches

Compressive sensing (CS) exploits sparsity to recover sparse or compressible signals from dimensionality reducing, non-adaptive sensing mechanisms. Sparsity is also used to enhance interpretability in machine learning and statistics applications: While the ambient dimension is vast in modern data analysis problems, the relevant information therein typically resides in a much lower dimensional space. However, many solutions proposed nowadays do not leverage the true underlying structure. Recent results in CS extend the simple sparsity idea to more sophisticated {\em structured} sparsity models, which describe the interdependency between the nonzero components of a signal, allowing to increase the interpretability of the results and lead to better recovery performance. In order to better understand the impact of structured sparsity, in this chapter we analyze the connections between the discrete models and their convex relaxations, highlighting their relative advantages. We start with the general group sparse model and then elaborate on two important special cases: the dispersive and the hierarchical models. For each, we present the models in their discrete nature, discuss how to solve the ensuing discrete problems and then describe convex relaxations. We also consider more general structures as defined by set functions and present their convex proxies. Further, we discuss efficient optimization solutions for structured sparsity problems and illustrate structured sparsity in action via three applications.

* 30 pages, 18 figures

Via

Access Paper or Ask Questions