Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Afshin Rostamizadeh

Active Covering

Jun 04, 2021

Heinrich Jiang, Afshin Rostamizadeh

Abstract:We analyze the problem of active covering, where the learner is given an unlabeled dataset and can sequentially label query examples. The objective is to label query all of the positive examples in the fewest number of total label queries. We show under standard non-parametric assumptions that a classical support estimator can be repurposed as an offline algorithm attaining an excess query cost of $\widetilde{\Theta}(n^{D/(D+1)})$ compared to the optimal learner, where $n$ is the number of datapoints and $D$ is the dimension. We then provide a simple active learning method that attains an improved excess query cost of $\widetilde{O}(n^{(D-1)/D})$. Furthermore, the proposed algorithms only require access to the positive labeled examples, which in certain settings provides additional computational and privacy benefits. Finally, we show that the active learning method consistently outperforms offline methods as well as a variety of baselines on a wide range of benchmark image-based datasets.

* ICML 2021

Via

Access Paper or Ask Questions

Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms

Oct 11, 2020

Maruan Al-Shedivat, Jennifer Gillenwater, Eric Xing, Afshin Rostamizadeh

Figure 1 for Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms

Figure 2 for Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms

Figure 3 for Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms

Figure 4 for Federated Learning via Posterior Averaging: A New Perspective and Practical Algorithms

Abstract:Federated learning is typically approached as an optimization problem, where the goal is to minimize a global loss function by distributing computation across client devices that possess local data and specify different parts of the global objective. We present an alternative perspective and formulate federated learning as a posterior inference problem, where the goal is to infer a global posterior distribution by having client devices each infer the posterior of their local data. While exact inference is often intractable, this perspective provides a principled way to search for global optima in federated settings. Further, starting with the analysis of federated quadratic objectives, we develop a computation- and communication-efficient approximate posterior inference algorithm -- federated posterior averaging (FedPA). Our algorithm uses MCMC for approximate inference of local posteriors on the clients and efficiently communicates their statistics to the server, where the latter uses them to refine a global estimate of the posterior mode. Finally, we show that FedPA generalizes federated averaging (FedAvg), can similarly benefit from adaptive optimizers, and yields state-of-the-art results on four realistic and challenging benchmarks, converging faster, to better optima.

* 27 pages, 7 figures

Via

Access Paper or Ask Questions

An Analysis of SVD for Deep Rotation Estimation

Jun 25, 2020

Jake Levinson, Carlos Esteves, Kefan Chen, Noah Snavely, Angjoo Kanazawa, Afshin Rostamizadeh, Ameesh Makadia

Figure 1 for An Analysis of SVD for Deep Rotation Estimation

Figure 2 for An Analysis of SVD for Deep Rotation Estimation

Figure 3 for An Analysis of SVD for Deep Rotation Estimation

Figure 4 for An Analysis of SVD for Deep Rotation Estimation

Abstract:Symmetric orthogonalization via SVD, and closely related procedures, are well-known techniques for projecting matrices onto $O(n)$ or $SO(n)$. These tools have long been used for applications in computer vision, for example optimal 3D alignment problems solved by orthogonal Procrustes, rotation averaging, or Essential matrix decomposition. Despite its utility in different settings, SVD orthogonalization as a procedure for producing rotation matrices is typically overlooked in deep learning models, where the preferences tend toward classic representations like unit quaternions, Euler angles, and axis-angle, or more recently-introduced methods. Despite the importance of 3D rotations in computer vision and robotics, a single universally effective representation is still missing. Here, we explore the viability of SVD orthogonalization for 3D rotations in neural networks. We present a theoretical analysis that shows SVD is the natural choice for projecting onto the rotation group. Our extensive quantitative analysis shows simply replacing existing representations with the SVD orthogonalization procedure obtains state of the art performance in many deep learning applications covering both supervised and unsupervised training.

Via

Access Paper or Ask Questions

Combining MixMatch and Active Learning for Better Accuracy with Fewer Labels

Dec 03, 2019

Shuang Song, David Berthelot, Afshin Rostamizadeh

Figure 1 for Combining MixMatch and Active Learning for Better Accuracy with Fewer Labels

Figure 2 for Combining MixMatch and Active Learning for Better Accuracy with Fewer Labels

Figure 3 for Combining MixMatch and Active Learning for Better Accuracy with Fewer Labels

Figure 4 for Combining MixMatch and Active Learning for Better Accuracy with Fewer Labels

Abstract:We propose using active learning based techniques to further improve the state-of-the-art semi-supervised learning MixMatch algorithm. We provide a thorough empirical evaluation of several active-learning and baseline methods, which successfully demonstrate a significant improvement on the benchmark CIFAR-10, CIFAR-100, and SVHN datasets (as much as 1.5% in absolute accuracy). We also provide an empirical analysis of the cost trade-off between incrementally gathering more labeled versus unlabeled data. This analysis can be used to measure the relative value of labeled/unlabeled data at different points of the learning curve, where we find that although the incremental value of labeled data can be as much as 20x that of unlabeled, it quickly diminishes to less than 3x once more than 2,000 labeled example are observed. Code can be found at https://github.com/google-research/mma.

Via

Access Paper or Ask Questions

The Practical Challenges of Active Learning: Lessons Learned from Live Experimentation

Jun 28, 2019

Jean-François Kagy, Tolga Kayadelen, Ji Ma, Afshin Rostamizadeh, Jana Strnadova

Figure 1 for The Practical Challenges of Active Learning: Lessons Learned from Live Experimentation

Figure 2 for The Practical Challenges of Active Learning: Lessons Learned from Live Experimentation

Figure 3 for The Practical Challenges of Active Learning: Lessons Learned from Live Experimentation

Abstract:We tested in a live setting the use of active learning for selecting text sentences for human annotations used in training a Thai segmentation machine learning model. In our study, two concurrent annotated samples were constructed, one through random sampling of sentences from a text corpus, and the other through model-based scoring and ranking of sentences from the same corpus. In the course of the experiment, we observed the effect of significant changes to the learning environment which are likely to occur in real-world learning tasks. We describe how our active learning strategy interacted with these events and discuss other practical challenges encountered in using active learning in the live setting.

* Presented at 2019 ICML Workshop on Human in the Loop Learning (HILL 2019), Long Beach, USA

Via

Access Paper or Ask Questions

SysML: The New Frontier of Machine Learning Systems

May 01, 2019

Alexander Ratner, Dan Alistarh, Gustavo Alonso, David G. Andersen, Peter Bailis, Sarah Bird, Nicholas Carlini, Bryan Catanzaro, Jennifer Chayes, Eric Chung(+59 more)

Abstract:Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, SysML, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.

Via

Access Paper or Ask Questions

Categorical Feature Compression via Submodular Optimization

Apr 30, 2019

MohammadHossein Bateni, Lin Chen, Hossein Esfandiari, Thomas Fu, Vahab S. Mirrokni, Afshin Rostamizadeh

Figure 1 for Categorical Feature Compression via Submodular Optimization

Figure 2 for Categorical Feature Compression via Submodular Optimization

Figure 3 for Categorical Feature Compression via Submodular Optimization

Figure 4 for Categorical Feature Compression via Submodular Optimization

Abstract:In the era of big data, learning from categorical features with very large vocabularies (e.g., 28 million for the Criteo click prediction dataset) has become a practical challenge for machine learning researchers and practitioners. We design a highly-scalable vocabulary compression algorithm that seeks to maximize the mutual information between the compressed categorical feature and the target binary labels and we furthermore show that its solution is guaranteed to be within a $1-1/e \approx 63\%$ factor of the global optimal solution. To achieve this, we introduce a novel re-parametrization of the mutual information objective, which we prove is submodular, and design a data structure to query the submodular function in amortized $O(\log n )$ time (where $n$ is the input vocabulary size). Our complete algorithm is shown to operate in $O(n \log n )$ time. Additionally, we design a distributed implementation in which the query data structure is decomposed across $O(k)$ machines such that each machine only requires $O(\frac n k)$ space, while still preserving the approximation guarantee and using only logarithmic rounds of computation. We also provide analysis of simple alternative heuristic compression methods to demonstrate they cannot achieve any approximation guarantee. Using the large-scale Criteo learning task, we demonstrate better performance in retaining mutual information and also verify competitive learning performance compared to other baseline methods.

* Accepted to ICML 2019. Authors are listed in alphabetical order

Via

Access Paper or Ask Questions

Massively Parallel Hyperparameter Tuning

Oct 17, 2018

Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, Ameet Talwalkar

Figure 1 for Massively Parallel Hyperparameter Tuning

Figure 2 for Massively Parallel Hyperparameter Tuning

Figure 3 for Massively Parallel Hyperparameter Tuning

Figure 4 for Massively Parallel Hyperparameter Tuning

Abstract:Modern learning models are characterized by large hyperparameter spaces. In order to adequately explore these large spaces, we must evaluate a large number of configurations, typically orders of magnitude more configurations than available parallel workers. Given the growing costs of model training, we would ideally like to perform this search in roughly the same wall-clock time needed to train a single model. In this work, we tackle this challenge by introducing ASHA, a simple and robust hyperparameter tuning algorithm with solid theoretical underpinnings that exploits parallelism and aggressive early-stopping. Our extensive empirical results show that ASHA slightly outperforms Fabolas and Population Based Tuning, state-of-the hyperparameter tuning methods; scales linearly with the number of workers in distributed settings; converges to a high quality configuration in half the time taken by Vizier (Google's internal hyperparameter tuning service) in an experiment with 500 workers; and beats the published result for a near state-of-the-art LSTM architecture in under 2x the time to train a single model.

* Corrected typo in Algorithm 1

Via

Access Paper or Ask Questions

The Sparse Recovery Autoencoder

Jul 05, 2018

Shanshan Wu, Alexandros G. Dimakis, Sujay Sanghavi, Felix X. Yu, Daniel Holtmann-Rice, Dmitry Storcheus, Afshin Rostamizadeh, Sanjiv Kumar

Figure 1 for The Sparse Recovery Autoencoder

Figure 2 for The Sparse Recovery Autoencoder

Figure 3 for The Sparse Recovery Autoencoder

Figure 4 for The Sparse Recovery Autoencoder

Abstract:Linear encoding of sparse vectors is widely popular, but is most commonly data-independent -- missing any possible extra (but a-priori unknown) structure beyond sparsity. In this paper we present a new method to learn linear encoders that adapt to data, while still performing well with the widely used $\ell_1$ decoder. The convex $\ell_1$ decoder prevents gradient propagation as needed in standard autoencoder training. Our method is based on the insight that unfolding the convex decoder into $T$ projected gradient steps can address this issue. Our method can be seen as a data-driven way to learn a compressed sensing matrix. Our experiments show that there is indeed additional structure beyond sparsity in several real datasets. Our autoencoder is able to discover it and exploit it to create excellent reconstructions with fewer measurements compared to the previous state of the art methods.

* 23 pages, 8 figures

Via

Access Paper or Ask Questions

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Jun 18, 2018

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar

Figure 1 for Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Figure 2 for Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Figure 3 for Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Figure 4 for Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Abstract:Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce a novel algorithm, Hyperband, for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Hyperband with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that Hyperband can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems.

* Journal of Machine Learning Research 18 (2018) 1-52
* Changes: - Updated to JMLR version

Via

Access Paper or Ask Questions