Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Surbhi Goel

Understanding Contrastive Learning Requires Incorporating Inductive Biases

Feb 28, 2022

Nikunj Saunshi, Jordan Ash, Surbhi Goel, Dipendra Misra, Cyril Zhang, Sanjeev Arora, Sham Kakade, Akshay Krishnamurthy

Figure 1 for Understanding Contrastive Learning Requires Incorporating Inductive Biases

Figure 2 for Understanding Contrastive Learning Requires Incorporating Inductive Biases

Figure 3 for Understanding Contrastive Learning Requires Incorporating Inductive Biases

Figure 4 for Understanding Contrastive Learning Requires Incorporating Inductive Biases

Abstract:Contrastive learning is a popular form of self-supervised learning that encourages augmentations (views) of the same input to have more similar representations compared to augmentations of different inputs. Recent attempts to theoretically explain the success of contrastive learning on downstream classification tasks prove guarantees depending on properties of {\em augmentations} and the value of {\em contrastive loss} of representations. We demonstrate that such analyses, that ignore {\em inductive biases} of the function class and training algorithm, cannot adequately explain the success of contrastive learning, even {\em provably} leading to vacuous guarantees in some settings. Extensive experiments on image and text domains highlight the ubiquity of this problem -- different function classes and algorithms behave very differently on downstream tasks, despite having the same augmentations and contrastive losses. Theoretical analysis is presented for the class of linear representations, where incorporating inductive biases of the function class allows contrastive learning to work with less stringent conditions compared to prior analyses.

Via

Access Paper or Ask Questions

Anti-Concentrated Confidence Bonuses for Scalable Exploration

Oct 21, 2021

Jordan T. Ash, Cyril Zhang, Surbhi Goel, Akshay Krishnamurthy, Sham Kakade

Figure 1 for Anti-Concentrated Confidence Bonuses for Scalable Exploration

Figure 2 for Anti-Concentrated Confidence Bonuses for Scalable Exploration

Figure 3 for Anti-Concentrated Confidence Bonuses for Scalable Exploration

Figure 4 for Anti-Concentrated Confidence Bonuses for Scalable Exploration

Abstract:Intrinsic rewards play a central role in handling the exploration-exploitation trade-off when designing sequential decision-making algorithms, in both foundational theory and state-of-the-art deep reinforcement learning. The LinUCB algorithm, a centerpiece of the stochastic linear bandits literature, prescribes an elliptical bonus which addresses the challenge of leveraging shared information in large action spaces. This bonus scheme cannot be directly transferred to high-dimensional exploration problems, however, due to the computational cost of maintaining the inverse covariance matrix of action features. We introduce \emph{anti-concentrated confidence bounds} for efficiently approximating the elliptical bonus, using an ensemble of regressors trained to predict random noise from policy network-derived features. Using this approximation, we obtain stochastic linear bandit algorithms which obtain $\tilde O(d \sqrt{T})$ regret bounds for $\mathrm{poly}(d)$ fixed actions. We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic reward heuristics on Atari benchmarks.

Via

Access Paper or Ask Questions

Inductive Biases and Variable Creation in Self-Attention Mechanisms

Oct 19, 2021

Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Cyril Zhang

Figure 1 for Inductive Biases and Variable Creation in Self-Attention Mechanisms

Figure 2 for Inductive Biases and Variable Creation in Self-Attention Mechanisms

Figure 3 for Inductive Biases and Variable Creation in Self-Attention Mechanisms

Figure 4 for Inductive Biases and Variable Creation in Self-Attention Mechanisms

Abstract:Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules, where our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer layers create sparse variables: they can represent sparse functions of the input sequence, with sample complexity scaling only logarithmically with the context length. Furthermore, we propose new experimental protocols to support this analysis and to guide the practice of training Transformers, built around the large body of work on provably learning sparse Boolean functions.

Via

Access Paper or Ask Questions

Statistical Estimation from Dependent Data

Jul 20, 2021

Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Surbhi Goel, Anthimos Vardis Kandiros

Figure 1 for Statistical Estimation from Dependent Data

Figure 2 for Statistical Estimation from Dependent Data

Figure 3 for Statistical Estimation from Dependent Data

Abstract:We consider a general statistical estimation problem wherein binary labels across different observations are not independent conditioned on their feature vectors, but dependent, capturing settings where e.g. these observations are collected on a spatial domain, a temporal domain, or a social network, which induce dependencies. We model these dependencies in the language of Markov Random Fields and, importantly, allow these dependencies to be substantial, i.e do not assume that the Markov Random Field capturing these dependencies is in high temperature. As our main contribution we provide algorithms and statistically efficient estimation rates for this model, giving several instantiations of our bounds in logistic regression, sparse logistic regression, and neural network settings with dependent data. Our estimation guarantees follow from novel results for estimating the parameters (i.e. external fields and interaction strengths) of Ising models from a {\em single} sample. {We evaluate our estimation approach on real networked data, showing that it outperforms standard regression approaches that ignore dependencies, across three text classification datasets: Cora, Citeseer and Pubmed.}

* 41 pages, ICML 2021

Via

Access Paper or Ask Questions

Investigating the Role of Negatives in Contrastive Representation Learning

Jun 18, 2021

Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Dipendra Misra

Figure 1 for Investigating the Role of Negatives in Contrastive Representation Learning

Figure 2 for Investigating the Role of Negatives in Contrastive Representation Learning

Figure 3 for Investigating the Role of Negatives in Contrastive Representation Learning

Figure 4 for Investigating the Role of Negatives in Contrastive Representation Learning

Abstract:Noise contrastive learning is a popular technique for unsupervised representation learning. In this approach, a representation is obtained via reduction to supervised learning, where given a notion of semantic similarity, the learner tries to distinguish a similar (positive) example from a collection of random (negative) examples. The success of modern contrastive learning pipelines relies on many parameters such as the choice of data augmentation, the number of negative examples, and the batch size; however, there is limited understanding as to how these parameters interact and affect downstream performance. We focus on disambiguating the role of one of these parameters: the number of negative examples. Theoretically, we show the existence of a collision-coverage trade-off suggesting that the optimal number of negative examples should scale with the number of underlying concepts in the data. Empirically, we scrutinize the role of the number of negatives in both NLP and vision tasks. In the NLP task, we find that the results broadly agree with our theory, while our vision experiments are murkier with performance sometimes even being insensitive to the number of negatives. We discuss plausible explanations for this behavior and suggest future directions to better align theory and practice.

Via

Access Paper or Ask Questions

Gone Fishing: Neural Active Learning with Fisher Embeddings

Jun 17, 2021

Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Sham Kakade

Figure 1 for Gone Fishing: Neural Active Learning with Fisher Embeddings

Figure 2 for Gone Fishing: Neural Active Learning with Fisher Embeddings

Figure 3 for Gone Fishing: Neural Active Learning with Fisher Embeddings

Figure 4 for Gone Fishing: Neural Active Learning with Fisher Embeddings

Abstract:There is an increasing need for effective active learning algorithms that are compatible with deep neural networks. While there are many classic, well-studied sample selection methods, the non-convexity and varying internal representation of neural models make it unclear how to extend these approaches. This article introduces BAIT, a practical, tractable, and high-performing active learning algorithm for neural networks that addresses these concerns. BAIT draws inspiration from the theoretical analysis of maximum likelihood estimators (MLE) for parametric models. It selects batches of samples by optimizing a bound on the MLE error in terms of the Fisher information, which we show can be implemented efficiently at scale by exploiting linear-algebraic structure especially amenable to execution on modern hardware. Our experiments show that BAIT outperforms the previous state of the art on both classification and regression problems, and is flexible enough to be used with a variety of model architectures.

Via

Access Paper or Ask Questions

Acceleration via Fractal Learning Rate Schedules

Mar 01, 2021

Naman Agarwal, Surbhi Goel, Cyril Zhang

Figure 1 for Acceleration via Fractal Learning Rate Schedules

Figure 2 for Acceleration via Fractal Learning Rate Schedules

Figure 3 for Acceleration via Fractal Learning Rate Schedules

Figure 4 for Acceleration via Fractal Learning Rate Schedules

Abstract:When balancing the practical tradeoffs of iterative methods for large-scale optimization, the learning rate schedule remains notoriously difficult to understand and expensive to tune. We demonstrate the presence of these subtleties even in the innocuous case when the objective is a convex quadratic. We reinterpret an iterative algorithm from the numerical analysis literature as what we call the Chebyshev learning rate schedule for accelerating vanilla gradient descent, and show that the problem of mitigating instability leads to a fractal ordering of step sizes. We provide some experiments and discussion to challenge current understandings of the "edge of stability" in deep learning: even in simple settings, provable acceleration can be obtained by making negative local progress on the objective.

Via

Access Paper or Ask Questions

Tight Hardness Results for Training Depth-2 ReLU Networks

Nov 27, 2020

Surbhi Goel, Adam Klivans, Pasin Manurangsi, Daniel Reichman

Abstract:We prove several hardness results for training depth-2 neural networks with the ReLU activation function; these networks are simply weighted sums (that may include negative coefficients) of ReLUs. Our goal is to output a depth-2 neural network that minimizes the square loss with respect to a given training set. We prove that this problem is NP-hard already for a network with a single ReLU. We also prove NP-hardness for outputting a weighted sum of $k$ ReLUs minimizing the squared error (for $k>1$) even in the realizable setting (i.e., when the labels are consistent with an unknown depth-2 ReLU network). We are also able to obtain lower bounds on the running time in terms of the desired additive error $\epsilon$. To obtain our lower bounds, we use the Gap Exponential Time Hypothesis (Gap-ETH) as well as a new hypothesis regarding the hardness of approximating the well known Densest $\kappa$-Subgraph problem in subexponential time (these hypotheses are used separately in proving different lower bounds). For example, we prove that under reasonable hardness assumptions, any proper learning algorithm for finding the best fitting ReLU must run in time exponential in $1/\epsilon^2$. Together with a previous work regarding improperly learning a ReLU (Goel et al., COLT'17), this implies the first separation between proper and improper algorithms for learning a ReLU. We also study the problem of properly learning a depth-2 network of ReLUs with bounded weights giving new (worst-case) upper bounds on the running time needed to learn such networks both in the realizable and agnostic settings. Our upper bounds on the running time essentially matches our lower bounds in terms of the dependency on $\epsilon$.

* To appear in ITCS'21

Via

Access Paper or Ask Questions

From Boltzmann Machines to Neural Networks and Back Again

Jul 25, 2020

Surbhi Goel, Adam Klivans, Frederic Koehler

Figure 1 for From Boltzmann Machines to Neural Networks and Back Again

Figure 2 for From Boltzmann Machines to Neural Networks and Back Again

Figure 3 for From Boltzmann Machines to Neural Networks and Back Again

Figure 4 for From Boltzmann Machines to Neural Networks and Back Again

Abstract:Graphical models are powerful tools for modeling high-dimensional data, but learning graphical models in the presence of latent variables is well-known to be difficult. In this work we give new results for learning Restricted Boltzmann Machines, probably the most well-studied class of latent variable models. Our results are based on new connections to learning two-layer neural networks under $\ell_{\infty}$ bounded input; for both problems, we give nearly optimal results under the conjectured hardness of sparse parity with noise. Using the connection between RBMs and feedforward networks, we also initiate the theoretical study of $supervised~RBMs$ [Hinton, 2012], a version of neural-network learning that couples distributional assumptions induced from the underlying graphical model with the architecture of the unknown function class. We then give an algorithm for learning a natural class of supervised RBMs with better runtime than what is possible for its related class of networks without distributional assumptions.

Via

Access Paper or Ask Questions

Statistical-Query Lower Bounds via Functional Gradients

Jun 29, 2020

Surbhi Goel, Aravind Gollakota, Adam Klivans

Abstract:We give the first statistical-query lower bounds for agnostically learning any non-polynomial activation with respect to Gaussian marginals (e.g., ReLU, sigmoid, sign). For the specific problem of ReLU regression (equivalently, agnostically learning a ReLU), we show that any statistical-query algorithm with tolerance $n^{-\Theta(\epsilon^{-1/2})}$ must use at least $2^{n^c} \epsilon$ queries for some constant $c > 0$, where $n$ is the dimension and $\epsilon$ is the accuracy parameter. Our results rule out general (as opposed to correlational) SQ learning algorithms, which is unusual for real-valued learning problems. Our techniques involve a gradient boosting procedure for "amplifying" recent lower bounds due to Diakonikolas et al. (COLT 2020) and Goel et al. (ICML 2020) on the SQ dimension of functions computed by two-layer neural networks. The crucial new ingredient is the use of a nonstandard convex functional during the boosting procedure. This also yields a best-possible reduction between two commonly studied models of learning: agnostic learning and probabilistic concepts.

Via

Access Paper or Ask Questions