Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Le Song

Scale Up Nonlinear Component Analysis with Doubly Stochastic Gradients

Jan 10, 2016
Bo Xie, Yingyu Liang, Le Song

Figure 1 for Scale Up Nonlinear Component Analysis with Doubly Stochastic Gradients

Figure 2 for Scale Up Nonlinear Component Analysis with Doubly Stochastic Gradients

Figure 3 for Scale Up Nonlinear Component Analysis with Doubly Stochastic Gradients

Figure 4 for Scale Up Nonlinear Component Analysis with Doubly Stochastic Gradients

Nonlinear component analysis such as kernel Principle Component Analysis (KPCA) and kernel Canonical Correlation Analysis (KCCA) are widely used in machine learning, statistics and data analysis, but they can not scale up to big datasets. Recent attempts have employed random feature approximations to convert the problem to the primal form for linear computational complexity. However, to obtain high quality solutions, the number of random features should be the same order of magnitude as the number of data points, making such approach not directly applicable to the regime with millions of data points. We propose a simple, computationally efficient, and memory friendly algorithm based on the "doubly stochastic gradients" to scale up a range of kernel nonlinear component analysis, such as kernel PCA, CCA and SVD. Despite the \emph{non-convex} nature of these problems, our method enjoys theoretical guarantees that it converges at the rate $\tilde{O}(1/t)$ to the global optimum, even for the top $k$ eigen subspace. Unlike many alternatives, our algorithm does not require explicit orthogonalization, which is infeasible on big datasets. We demonstrate the effectiveness and scalability of our algorithm on large scale synthetic and real world datasets.

Via

Access Paper or Ask Questions

A Continuous-time Mutually-Exciting Point Process Framework for Prioritizing Events in Social Media

Nov 13, 2015
Mehrdad Farajtabar, Safoora Yousefi, Long Q. Tran, Le Song, Hongyuan Zha

Figure 1 for A Continuous-time Mutually-Exciting Point Process Framework for Prioritizing Events in Social Media

Figure 2 for A Continuous-time Mutually-Exciting Point Process Framework for Prioritizing Events in Social Media

Figure 3 for A Continuous-time Mutually-Exciting Point Process Framework for Prioritizing Events in Social Media

Figure 4 for A Continuous-time Mutually-Exciting Point Process Framework for Prioritizing Events in Social Media

The overwhelming amount and rate of information update in online social media is making it increasingly difficult for users to allocate their attention to their topics of interest, thus there is a strong need for prioritizing news feeds. The attractiveness of a post to a user depends on many complex contextual and temporal features of the post. For instance, the contents of the post, the responsiveness of a third user, and the age of the post may all have impact. So far, these static and dynamic features has not been incorporated in a unified framework to tackle the post prioritization problem. In this paper, we propose a novel approach for prioritizing posts based on a feature modulated multi-dimensional point process. Our model is able to simultaneously capture textual and sentiment features, and temporal features such as self-excitation, mutual-excitation and bursty nature of social interaction. As an evaluation, we also curated a real-world conversational benchmark dataset crawled from Facebook. In our experiments, we demonstrate that our algorithm is able to achieve the-state-of-the-art performance in terms of analyzing, predicting, and prioritizing events. In terms of interpretability of our method, we observe that features indicating individual user profile and linguistic characteristics of the events work best for prediction and prioritization of new events.

Via

Access Paper or Ask Questions

Scalable Kernel Methods via Doubly Stochastic Gradients

Sep 10, 2015
Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina Balcan, Le Song

Figure 1 for Scalable Kernel Methods via Doubly Stochastic Gradients

The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called "doubly stochastic functional gradients". Our approach relies on the fact that many kernel methods can be expressed as convex optimization problems, and we solve the problems by making two unbiased stochastic approximations to the functional gradient, one using random training points and another using random functions associated with the kernel, and then descending using this noisy functional gradient. We show that a function produced by this procedure after $t$ iterations converges to the optimal function in the reproducing kernel Hilbert space in rate $O(1/t)$, and achieves a generalization performance of $O(1/\sqrt{t})$. This doubly stochasticity also allows us to avoid keeping the support vectors and to implement the algorithm in a small memory footprint, which is linear in number of iterations and independent of data dimension. Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show that our method can achieve competitive performance to neural nets in datasets such as 8 million handwritten digits from MNIST, 2.3 million energy materials from MolecularSpace, and 1 million photos from ImageNet.

* 32 pages, 22 figures

Via

Access Paper or Ask Questions

Online Supervised Subspace Tracking

Sep 01, 2015
Yao Xie, Ruiyang Song, Hanjun Dai, Qingbin Li, Le Song

Figure 1 for Online Supervised Subspace Tracking

Figure 2 for Online Supervised Subspace Tracking

Figure 3 for Online Supervised Subspace Tracking

Figure 4 for Online Supervised Subspace Tracking

We present a framework for supervised subspace tracking, when there are two time series $x_t$ and $y_t$, one being the high-dimensional predictors and the other being the response variables and the subspace tracking needs to take into consideration of both sequences. It extends the classic online subspace tracking work which can be viewed as tracking of $x_t$ only. Our online sufficient dimensionality reduction (OSDR) is a meta-algorithm that can be applied to various cases including linear regression, logistic regression, multiple linear regression, multinomial logistic regression, support vector machine, the random dot product model and the multi-scale union-of-subspace model. OSDR reduces data-dimensionality on-the-fly with low-computational complexity and it can also handle missing data and dynamic data. OSDR uses an alternating minimization scheme and updates the subspace via gradient descent on the Grassmannian manifold. The subspace update can be performed efficiently utilizing the fact that the Grassmannian gradient with respect to the subspace in many settings is rank-one (or low-rank in certain cases). The optimization problem for OSDR is non-convex and hard to analyze in general; we provide convergence analysis of OSDR in a simple linear regression setting. The good performance of OSDR compared with the conventional unsupervised subspace tracking are demonstrated via numerical examples on simulated and real data.

* Submitted for journal publication

Via

Access Paper or Ask Questions

Deep Fried Convnets

Jul 17, 2015
Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, Ziyu Wang

The fully connected layers of a deep convolutional neural network typically contain over 90% of the network parameters, and consume the majority of the memory required to store the network parameters. Reducing the number of parameters while preserving essentially the same predictive performance is critically important for operating deep neural networks in memory constrained environments such as GPUs or embedded devices. In this paper we show how kernel methods, in particular a single Fastfood layer, can be used to replace all fully connected layers in a deep convolutional neural network. This novel Fastfood layer is also end-to-end trainable in conjunction with convolutional layers, allowing us to combine them into a new architecture, named deep fried convolutional networks, which substantially reduces the memory footprint of convolutional networks trained on MNIST and ImageNet with no drop in predictive performance.

* svd experiments included

Via

Access Paper or Ask Questions

A la Carte - Learning Fast Kernels

Dec 19, 2014
Zichao Yang, Alexander J. Smola, Le Song, Andrew Gordon Wilson

Figure 1 for A la Carte - Learning Fast Kernels

Figure 2 for A la Carte - Learning Fast Kernels

Figure 3 for A la Carte - Learning Fast Kernels

Figure 4 for A la Carte - Learning Fast Kernels

Kernel methods have great promise for learning rich statistical representations of large modern datasets. However, compared to neural networks, kernel methods have been perceived as lacking in scalability and flexibility. We introduce a family of fast, flexible, lightly parametrized and general purpose kernel learning methods, derived from Fastfood basis function expansions. We provide mechanisms to learn the properties of groups of spectral frequencies in these expansions, which require only O(mlogd) time and O(m) memory, for m basis functions and d input dimensions. We show that the proposed methods can learn a wide class of kernels, outperforming the alternatives in accuracy, speed, and memory consumption.

Via

Access Paper or Ask Questions

Active Learning and Best-Response Dynamics

Jun 25, 2014
Maria-Florina Balcan, Chris Berlind, Avrim Blum, Emma Cohen, Kaushik Patnaik, Le Song

Figure 1 for Active Learning and Best-Response Dynamics

Figure 2 for Active Learning and Best-Response Dynamics

Figure 3 for Active Learning and Best-Response Dynamics

Figure 4 for Active Learning and Best-Response Dynamics

We examine an important setting for engineered systems in which low-power distributed sensors are each making highly noisy measurements of some unknown target function. A center wants to accurately learn this function by querying a small number of sensors, which ordinarily would be impossible due to the high noise rate. The question we address is whether local communication among sensors, together with natural best-response dynamics in an appropriately-defined game, can denoise the system without destroying the true signal and allow the center to succeed from only a small number of active queries. By using techniques from game theory and empirical processes, we prove positive (and negative) results on the denoising power of several natural dynamics. We then show experimentally that when combined with recent agnostic active learning algorithms, this process can achieve low error from very few queries, performing substantially better than active or passive learning without these denoising dynamics as well as passive learning with denoising.

Via

Access Paper or Ask Questions

Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm

May 12, 2014
Hadi Daneshmand, Manuel Gomez-Rodriguez, Le Song, Bernhard Schoelkopf

Figure 1 for Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm

Figure 2 for Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm

Figure 3 for Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm

Figure 4 for Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm

Information spreads across social and technological networks, but often the network structures are hidden from us and we only observe the traces left by the diffusion processes, called cascades. Can we recover the hidden network structures from these observed cascades? What kind of cascades and how many cascades do we need? Are there some network structures which are more difficult than others to recover? Can we design efficient inference algorithms with provable guarantees? Despite the increasing availability of cascade data and methods for inferring networks from these data, a thorough theoretical understanding of the above questions remains largely unexplored in the literature. In this paper, we investigate the network structure inference problem for a general family of continuous-time diffusion models using an $l_1$-regularized likelihood maximization framework. We show that, as long as the cascade sampling process satisfies a natural incoherence condition, our framework can recover the correct network structure with high probability if we observe $O(d^3 \log N)$ cascades, where $d$ is the maximum number of parents of a node and $N$ is the total number of nodes. Moreover, we develop a simple and efficient soft-thresholding inference algorithm, which we use to illustrate the consequences of our theoretical results, and show that our framework outperforms other alternatives in practice.

* To appear in the 31st International Conference on Machine Learning (ICML), 2014

Via

Access Paper or Ask Questions

Budgeted Influence Maximization for Multiple Products

Apr 16, 2014
Nan Du, Yingyu Liang, Maria Florina Balcan, Le Song

Figure 1 for Budgeted Influence Maximization for Multiple Products

Figure 2 for Budgeted Influence Maximization for Multiple Products

Figure 3 for Budgeted Influence Maximization for Multiple Products

Figure 4 for Budgeted Influence Maximization for Multiple Products

The typical algorithmic problem in viral marketing aims to identify a set of influential users in a social network, who, when convinced to adopt a product, shall influence other users in the network and trigger a large cascade of adoptions. However, the host (the owner of an online social platform) often faces more constraints than a single product, endless user attentions, unlimited budget and unbounded time; in reality, multiple products need to be advertised, each user can tolerate only a small number of recommendations, influencing user has a cost and advertisers have only limited budgets, and the adoptions need to be maximized within a short time window. Given theses myriads of user, monetary, and timing constraints, it is extremely challenging for the host to design principled and efficient viral market algorithms with provable guarantees. In this paper, we provide a novel solution by formulating the problem as a submodular maximization in a continuous-time diffusion model under an intersection of a matroid and multiple knapsack constraints. We also propose an adaptive threshold greedy algorithm which can be faster than the traditional greedy algorithm with lazy evaluation, and scalable to networks with million of nodes. Furthermore, our mathematical formulation allows us to prove that the algorithm can achieve an approximation factor of $k_a/(2+2 k)$ when $k_a$ out of the $k$ knapsack constraints are active, which also improves over previous guarantees from combinatorial optimization literature. In the case when influencing each user has uniform cost, the approximation becomes even better to a factor of $1/3$. Extensive synthetic and real world experiments demonstrate that our budgeted influence maximization algorithm achieves the-state-of-the-art in terms of both effectiveness and scalability, often beating the next best by significant margins.

Via

Access Paper or Ask Questions

Nonparametric Latent Tree Graphical Models: Inference, Estimation, and Structure Learning

Jan 16, 2014
Le Song, Han Liu, Ankur Parikh, Eric Xing

Figure 1 for Nonparametric Latent Tree Graphical Models: Inference, Estimation, and Structure Learning

Figure 2 for Nonparametric Latent Tree Graphical Models: Inference, Estimation, and Structure Learning

Figure 3 for Nonparametric Latent Tree Graphical Models: Inference, Estimation, and Structure Learning

Figure 4 for Nonparametric Latent Tree Graphical Models: Inference, Estimation, and Structure Learning

Tree structured graphical models are powerful at expressing long range or hierarchical dependency among many variables, and have been widely applied in different areas of computer science and statistics. However, existing methods for parameter estimation, inference, and structure learning mainly rely on the Gaussian or discrete assumptions, which are restrictive under many applications. In this paper, we propose new nonparametric methods based on reproducing kernel Hilbert space embeddings of distributions that can recover the latent tree structures, estimate the parameters, and perform inference for high dimensional continuous and non-Gaussian variables. The usefulness of the proposed methods are illustrated by thorough numerical results.

* 29 pages, 5 figures

Via

Access Paper or Ask Questions