Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Quanquan Gu

Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins

Oct 01, 2020

Spencer Frei, Yuan Cao, Quanquan Gu

Figure 1 for Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins

Abstract:We analyze the properties of gradient descent on convex surrogates for the zero-one loss for the agnostic learning of linear halfspaces. If $\mathsf{OPT}$ is the best classification error achieved by a halfspace, by appealing to the notion of soft margins we are able to show that gradient descent finds halfspaces with classification error $\tilde O(\mathsf{OPT}^{1/2}) + \varepsilon$ in $\mathrm{poly}(d,1/\varepsilon)$ time and sample complexity for a broad class of distributions that includes log-concave isotropic distributions as a subclass. Along the way we answer a question recently posed by Ji et al. (2020) on how the tail behavior of a loss function can affect sample complexity and runtime guarantees for gradient descent.

* 24 pages, 1 table

Via

Access Paper or Ask Questions

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Jun 23, 2020

Dongruo Zhou, Jiafan He, Quanquan Gu

Figure 1 for Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Abstract:Modern tasks in reinforcement learning are always with large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represents states and actions in a low dimensional space. In this paper, we study reinforcement learning with feature mapping for discounted Markov Decision Processes (MDPs). We propose a novel algorithm which makes use of the feature mapping and obtains a $\tilde O(d\sqrt{T}/(1-\gamma)^2)$ regret, where $d$ is the dimension of the feature space, $T$ is the time horizon and $\gamma$ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing to a generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by $\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $(1-\gamma)^{-0.5}$ factor.

* 28 pages, 1 figure

Via

Access Paper or Ask Questions

RayS: A Ray Searching Method for Hard-label Adversarial Attack

Jun 23, 2020

Jinghui Chen, Quanquan Gu

Figure 1 for RayS: A Ray Searching Method for Hard-label Adversarial Attack

Figure 2 for RayS: A Ray Searching Method for Hard-label Adversarial Attack

Figure 3 for RayS: A Ray Searching Method for Hard-label Adversarial Attack

Figure 4 for RayS: A Ray Searching Method for Hard-label Adversarial Attack

Abstract:Deep neural networks are vulnerable to adversarial attacks. Among different attack settings, the most challenging yet the most practical one is the hard-label setting where the attacker only has access to the hard-label output (prediction label) of the target model. Previous attempts are neither effective enough in terms of attack success rate nor efficient enough in terms of query complexity under the widely used $L_\infty$ norm threat model. In this paper, we present the Ray Searching attack (RayS), which greatly improves the hard-label attack effectiveness as well as efficiency. Unlike previous works, we reformulate the continuous problem of finding the closest decision boundary into a discrete problem that does not require any zeroth-order gradient estimation. In the meantime, all unnecessary searches are eliminated via a fast check step. This significantly reduces the number of queries needed for our hard-label attack. Moreover, interestingly, we found that the proposed RayS attack can also be used as a sanity check for possible "falsely robust" models. On several recently proposed defenses that claim to achieve the state-of-the-art robust accuracy, our attack method demonstrates that the current white-box/black-box attacks could still give a false sense of security and the robust accuracy drop between the most popular PGD attack and RayS attack could be as large as $28\%$. We believe that our proposed RayS attack could help identify falsely robust models that beat most white-box/black-box attacks.

* 9 pages, 4 figures, 9 tables. In KDD 2020

Via

Access Paper or Ask Questions

Revisiting Membership Inference Under Realistic Assumptions

Jun 21, 2020

Bargav Jayaraman, Lingxiao Wang, David Evans, Quanquan Gu

Figure 1 for Revisiting Membership Inference Under Realistic Assumptions

Figure 2 for Revisiting Membership Inference Under Realistic Assumptions

Figure 3 for Revisiting Membership Inference Under Realistic Assumptions

Figure 4 for Revisiting Membership Inference Under Realistic Assumptions

Abstract:Membership inference attacks on models trained using machine learning have been shown to pose significant privacy risks. However, previous works on membership inference assume a balanced prior distribution where the adversary randomly chooses target records from a pool that has equal numbers of members and non-members. Such an assumption of balanced prior is unrealistic in practical scenarios. This paper studies membership inference attacks under more realistic assumptions. First, we consider skewed priors where a non-member is more likely to occur than a member record. For this, we use metric based on positive predictive value (PPV) in conjunction with membership advantage for privacy leakage evaluation, since PPV considers the prior. Second, we consider adversaries that can select inference thresholds according to their attack goals. For this, we develop a threshold selection procedure that improves inference attacks. We also propose a new membership inference attack called Merlin which outperforms previous attacks. Our experimental evaluation shows that while models trained without privacy mechanisms are vulnerable to membership inference attacks in balanced prior settings, there appears to be negligible privacy risk in the skewed prior setting. Code for our experiments can be found here: https://github.com/bargavj/EvaluatingDPML.

Via

Access Paper or Ask Questions

Optimization Theory for ReLU Neural Networks Trained with Normalization Layers

Jun 11, 2020

Yonatan Dukler, Quanquan Gu, Guido Montúfar

Figure 1 for Optimization Theory for ReLU Neural Networks Trained with Normalization Layers

Abstract:The success of deep neural networks is in part due to the use of normalization layers. Normalization layers like Batch Normalization, Layer Normalization and Weight Normalization are ubiquitous in practice, as they improve generalization performance and speed up training significantly. Nonetheless, the vast majority of current deep learning theory and non-convex optimization literature focuses on the un-normalized setting, where the functions under consideration do not exhibit the properties of commonly normalized neural networks. In this paper, we bridge this gap by giving the first global convergence result for two-layer neural networks with ReLU activations trained with a normalization layer, namely Weight Normalization. Our analysis shows how the introduction of normalization layers changes the optimization landscape and can enable faster convergence as compared with un-normalized neural networks.

* To be presented at ICML 2020

Via

Access Paper or Ask Questions

Agnostic Learning of a Single Neuron with Gradient Descent

Jun 11, 2020

Spencer Frei, Yuan Cao, Quanquan Gu

Figure 1 for Agnostic Learning of a Single Neuron with Gradient Descent

Figure 2 for Agnostic Learning of a Single Neuron with Gradient Descent

Figure 3 for Agnostic Learning of a Single Neuron with Gradient Descent

Abstract:We consider the problem of learning the best-fitting single neuron as measured by the expected square loss $\mathbb{E}_{(x,y)\sim \mathcal{D}}[(\sigma(w^\top x)-y)^2]$ over some unknown joint distribution $\mathcal{D}$ by using gradient descent to minimize the empirical risk induced by a set of i.i.d. samples $S\sim \mathcal{D}^n$. The activation function $\sigma$ is an arbitrary Lipschitz and non-decreasing function, making the optimization problem nonconvex and nonsmooth in general, and covers typical neural network activation functions and inverse link functions in the generalized linear model setting. In the agnostic PAC learning setting, where no assumption on the relationship between the labels $y$ and the input $x$ is made, if the optimal population risk is $\mathsf{OPT}$, we show that gradient descent achieves population risk $O(\mathsf{OPT}^{1/2})+\epsilon$ in polynomial time and sample complexity. When labels take the form $y = \sigma(v^\top x) + \xi$ for zero-mean sub-Gaussian noise $\xi$, we show that gradient descent achieves population risk $\mathsf{OPT} + \epsilon$. Our sample complexity and runtime guarantees are (almost) dimension independent, and when $\sigma$ is strictly increasing and Lipschitz, require no distributional assumptions beyond boundedness. For ReLU, we show the same results under a nondegeneracy assumption for the marginal distribution of the input. To the best of our knowledge, this is the first result for agnostic learning of a single neuron using gradient descent.

* 28 pages, 3 tables

Via

Access Paper or Ask Questions

A Finite Time Analysis of Two Time-Scale Actor Critic Methods

May 04, 2020

Yue Wu, Weitong Zhang, Pan Xu, Quanquan Gu

Abstract:Actor-critic (AC) methods have exhibited great empirical success compared with other reinforcement learning algorithms, where the actor uses the policy gradient to improve the learning policy and the critic uses temporal difference learning to estimate the policy gradient. Under the two time-scale learning rate schedule, the asymptotic convergence of AC has been well studied in the literature. However, the non-asymptotic convergence and finite sample complexity of actor-critic methods are largely open. In this work, we provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting. We prove that the actor-critic method is guaranteed to find a first-order stationary point (i.e., $\|\nabla J(\boldsymbol{\theta})\|_2^2 \le \epsilon$) of the non-concave performance function $J(\boldsymbol{\theta})$, with $\mathcal{\tilde{O}}(\epsilon^{-2.5})$ sample complexity. To the best of our knowledge, this is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.

* 43 pages

Via

Access Paper or Ask Questions

Exploring Private Federated Learning with Laplacian Smoothing

May 01, 2020

Zhicong Liang, Bao Wang, Quanquan Gu, Stanley Osher, Yuan Yao

Figure 1 for Exploring Private Federated Learning with Laplacian Smoothing

Figure 2 for Exploring Private Federated Learning with Laplacian Smoothing

Figure 3 for Exploring Private Federated Learning with Laplacian Smoothing

Figure 4 for Exploring Private Federated Learning with Laplacian Smoothing

Abstract:Federated learning aims to protect data privacy by collaboratively learning a model without sharing private data among users. However, an adversary may still be able to infer the private training data by attacking the released model. Differential privacy(DP) provides a statistical guarantee against such attacks, at a privacy of possibly degenerating the accuracy or utility of the trained models. In this paper, we apply a utility enhancement scheme based on Laplacian smoothing for differentially-private federated learning (DP-Fed-LS), where the parameter aggregation with injected Gaussian noise is improved in statistical precision. We provide tight closed-form privacy bounds for both uniform and Poisson subsampling and derive corresponding DP guarantees for differential private federated learning, with or without Laplacian smoothing. Experiments over MNIST, SVHN and Shakespeare datasets show that the proposed method can improve model accuracy with DP-guarantee under both subsampling mechanisms.

Via

Access Paper or Ask Questions

MOTS: Minimax Optimal Thompson Sampling

Mar 03, 2020

Tianyuan Jin, Pan Xu, Jieming Shi, Xiaokui Xiao, Quanquan Gu

Figure 1 for MOTS: Minimax Optimal Thompson Sampling

Figure 2 for MOTS: Minimax Optimal Thompson Sampling

Abstract:Thompson sampling is one of the most widely used algorithms for many online decision problems, due to its simplicity in implementation and superior empirical performance over other state-of-the-art methods. Despite its popularity and empirical success, it has remained an open problem whether Thompson sampling can achieve the minimax optimal regret $O(\sqrt{KT})$ for $K$-armed bandit problems, where $T$ is the total time horizon. In this paper, we solve this long open problem by proposing a new Thompson sampling algorithm called MOTS that adaptively truncates the sampling result of the chosen arm at each time step. We prove that this simple variant of Thompson sampling achieves the minimax optimal regret bound $O(\sqrt{KT})$ for finite time horizon $T$ and also the asymptotic optimal regret bound when $T$ grows to infinity as well. This is the first time that the minimax optimality of multi-armed bandit problems has been attained by Thompson sampling type of algorithms.

* 17 pages, 2 figures

Via

Access Paper or Ask Questions

On the Global Convergence of Training Deep Linear ResNets

Mar 02, 2020

Difan Zou, Philip M. Long, Quanquan Gu

Figure 1 for On the Global Convergence of Training Deep Linear ResNets

Abstract:We study the convergence of gradient descent (GD) and stochastic gradient descent (SGD) for training $L$-hidden-layer linear residual networks (ResNets). We prove that for training deep residual networks with certain linear transformations at input and output layers, which are fixed throughout training, both GD and SGD with zero initialization on all hidden weights can converge to the global minimum of the training loss. Moreover, when specializing to appropriate Gaussian random linear transformations, GD and SGD provably optimize wide enough deep linear ResNets. Compared with the global convergence result of GD for training standard deep linear networks (Du & Hu 2019), our condition on the neural network width is sharper by a factor of $O(\kappa L)$, where $\kappa$ denotes the condition number of the covariance matrix of the training data. We further propose a modified identity input and output transformations, and show that a $(d+k)$-wide neural network is sufficient to guarantee the global convergence of GD/SGD, where $d,k$ are the input and output dimensions respectively.

* 26 pages, 1 figure. In ICLR 2020

Via

Access Paper or Ask Questions