We consider the fundamental problem of ReLU regression, where the goal is to output the best fitting ReLU with respect to square loss given access to draws from some unknown distribution. We give the first efficient, constant-factor approximation algorithm for this problem assuming the underlying distribution satisfies some weak concentration and anti-concentration conditions (and includes, for example, all log-concave distributions). This solves the main open problem of Goel et al., who proved hardness results for any exact algorithm for ReLU regression (up to an additive $\epsilon$). Using more sophisticated techniques, we can improve our results and obtain a polynomial-time approximation scheme for any subgaussian distribution. Given the aforementioned hardness results, these guarantees can not be substantially improved. Our main insight is a new characterization of surrogate losses for nonconvex activations. While prior work had established the existence of convex surrogates for monotone activations, we show that properties of the underlying distribution actually induce strong convexity for the loss, allowing us to relate the global minimum to the activation's Chow parameters.
We study the problem of learning adversarially robust halfspaces in the distribution-independent setting. In the realizable setting, we provide necessary and sufficient conditions on the adversarial perturbation sets under which halfspaces are efficiently robustly learnable. In the presence of random label noise, we give a simple computationally efficient algorithm for this problem with respect to any $\ell_p$-perturbation.
We consider the problem of computing the best-fitting ReLU with respect to square-loss on a training set when the examples have been drawn according to a spherical Gaussian distribution (the labels can be arbitrary). Let $\mathsf{opt} < 1$ be the population loss of the best-fitting ReLU. We prove: 1. Finding a ReLU with square-loss $\mathsf{opt} + \epsilon$ is as hard as the problem of learning sparse parities with noise, widely thought to be computationally intractable. This is the first hardness result for learning a ReLU with respect to Gaussian marginals, and our results imply -{\emph unconditionally}- that gradient descent cannot converge to the global minimum in polynomial time. 2. There exists an efficient approximation algorithm for finding the best-fitting ReLU that achieves error $O(\mathsf{opt}^{2/3})$. The algorithm uses a novel reduction to noisy halfspace learning with respect to $0/1$ loss. Prior work due to Soltanolkotabi [Sol17] showed that gradient descent can find the best-fitting ReLU with respect to Gaussian marginals, if the training set is exactly labeled by a ReLU.
We study the problem of learning graphical models with latent variables. We give the first algorithm for learning locally consistent (ferromagnetic or antiferromagnetic) Restricted Boltzmann Machines (or RBMs) with {\em arbitrary} external fields. Our algorithm has optimal dependence on dimension in the sample complexity and run time however it suffers from a sub-optimal dependency on the underlying parameters of the RBM. Prior results have been established only for {\em ferromagnetic} RBMs with {\em consistent} external fields (signs must be same)\cite{bresler2018learning}. The proposed algorithm strongly relies on the concavity of magnetization which does not hold in our setting. We show the following key structural property: even in the presence of arbitrary external field, for any two observed nodes that share a common latent neighbor, the covariance is high. This enables us to design a simple greedy algorithm that maximizes covariance to iteratively build the neighborhood of each vertex.
We consider the problem of learning the weighted edges of a mixture of two graphs from epidemic cascades. This is a natural setting in the context of social networks, where a post created by one user will not spread on the same graph if it is about basketball or if it is about politics. However, very little is known about whether this problem is solvable. To the best of our knowledge, we establish the first conditions under which this problem can be solved, and provide conditions under which the problem is provably not solvable. When the conditions are met, i.e. when the graphs are connected, with distinct edges, and have at least three edges, we give an efficient algorithm for learning the weights of both graphs with almost optimal sample complexity (up to log factors). We extend the results to the setting in which the priors of the mixture are unknown and obtain similar guarantees.
Giving provable guarantees for learning neural networks is a core challenge of machine learning theory. Most prior work gives parameter recovery guarantees for one hidden layer networks. In this work we study a two layer network where the top node instead of a sum (one layer) is a well-behaved multivariate polynomial in all its inputs. We show that if the thresholds (biases) of the first layer neurons are higher than $\Omega(\sqrt{\log d})$ for $d$ being the input dimension, then the weights are learnable under the gaussian input. Furthermore even for lower thresholds, we can learn the lowest layer using polynomial sample complexity although exponential time. As an application of our results, we give a polynomial time algorithm for learning an intersection of halfspaces that are $\Omega(\sqrt{\log d})$ far from the origin for gaussian input distribution. Finally for deep networks with depth larger than two, assuming the layers two onwards can be expressed as a polynomial by simply using the taylor series, we can learn the lowest layer under the conditions required by our assumptions.
Recent work has shown that additive threat models, which only permit the addition of bounded noise to the pixels of an image, are insufficient for fully capturing the space of imperceivable adversarial examples. For example, small rotations and spatial transformations can fool classifiers, remain imperceivable to humans, but have large additive distance from the original images. In this work, we leverage quantitative perceptual metrics like LPIPS and SSIM to define a novel threat model for adversarial attacks. To demonstrate the value of quantifying the perceptual distortion of adversarial examples, we present and employ a unifying framework fusing different attack styles. We first prove that our framework results in images that are unattainable by attack styles in isolation. We then perform adversarial training using attacks generated by our framework to demonstrate that networks are only robust to classes of adversarial perturbations they have been trained against, and combination attacks are stronger than any of their individual components. Finally, we experimentally demonstrate that our combined attacks retain the same perceptual distortion but induce far higher misclassification rates when compared against individual attacks.
We give the first efficient algorithm for learning the structure of an Ising model that tolerates independent failures; that is, each entry of the observed sample is missing with some unknown probability p. Our algorithm matches the essentially optimal runtime and sample complexity bounds of recent work for learning Ising models due to Klivans and Meka (2017). We devise a novel unbiased estimator for the gradient of the Interaction Screening Objective (ISO) due to Vuffray et al. (2016) and apply a stochastic multiplicative gradient descent algorithm to minimize this objective. Solutions to this minimization recover the neighborhood information of the underlying Ising model on a node by node basis.
We propose a new algorithm to learn a one-hidden-layer convolutional neural network where both the convolutional weights and the outputs weights are parameters to be learned. Our algorithm works for a general class of (potentially overlapping) patches, including commonly used structures for computer vision tasks. Our algorithm draws ideas from (1) isotonic regression for learning neural networks and (2) landscape analysis of non-convex matrix factorization problems. We believe these findings may inspire further development in designing provable algorithms for learning neural networks and other complex models.
We give a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU). We make no assumptions on the structure of the network, and the algorithm succeeds with respect to {\em any} distribution on the unit ball in $n$ dimensions (hidden weight vectors also have unit norm). This is the first assumption-free, provably efficient algorithm for learning neural networks with two nonlinear layers. Our algorithm-- {\em Alphatron}-- is a simple, iterative update rule that combines isotonic regression with kernel methods. It outputs a hypothesis that yields efficient oracle access to interpretable features. It also suggests a new approach to Boolean learning problems via real-valued conditional-mean functions, sidestepping traditional hardness results from computational learning theory. Along these lines, we subsume and improve many longstanding results for PAC learning Boolean functions to the more general, real-valued setting of {\em probabilistic concepts}, a model that (unlike PAC learning) requires non-i.i.d. noise-tolerance.