We study the fundamental problem of ReLU regression, where the goal is to fit Rectified Linear Units (ReLUs) to data. This supervised learning task is efficiently solvable in the realizable setting, but is known to be computationally hard with adversarial label noise. In this work, we focus on ReLU regression in the Massart noise model, a natural and well-studied semi-random noise model. In this model, the label of every point is generated according to a function in the class, but an adversary is allowed to change this value arbitrarily with some probability, which is {\em at most} $\eta < 1/2$. We develop an efficient algorithm that achieves exact parameter recovery in this model under mild anti-concentration assumptions on the underlying distribution. Such assumptions are necessary for exact recovery to be information-theoretically possible. We demonstrate that our algorithm significantly outperforms naive applications of $\ell_1$ and $\ell_2$ regression on both synthetic and real data.
We study the problem of PAC learning halfspaces on $\mathbb{R}^d$ with Massart noise under Gaussian marginals. In the Massart noise model, an adversary is allowed to flip the label of each point $\mathbf{x}$ with probability $\eta(\mathbf{x}) \leq \eta$, for some parameter $\eta \in [0,1/2]$. The goal of the learner is to output a hypothesis with missclassification error $\mathrm{opt} + \epsilon$, where $\mathrm{opt}$ is the error of the target halfspace. Prior work studied this problem assuming that the target halfspace is homogeneous and that the parameter $\eta$ is strictly smaller than $1/2$. We explore how the complexity of the problem changes when either of these assumptions is removed, establishing the following threshold phenomena: For $\eta = 1/2$, we prove a lower bound of $d^{\Omega (\log(1/\epsilon))}$ on the complexity of any Statistical Query (SQ) algorithm for the problem, which holds even for homogeneous halfspaces. On the positive side, we give a new learning algorithm for arbitrary halfspaces in this regime with sample complexity and running time $O_\epsilon(1) \, d^{O(\log(1/\epsilon))}$. For $\eta <1/2$, we establish a lower bound of $d^{\Omega(\log(1/\gamma))}$ on the SQ complexity of the problem, where $\gamma = \max\{\epsilon, \min\{\mathbf{Pr}[f(\mathbf{x}) = 1], \mathbf{Pr}[f(\mathbf{x}) = -1]\} \}$ and $f$ is the target halfspace. In particular, this implies an SQ lower bound of $d^{\Omega (\log(1/\epsilon) )}$ for learning arbitrary Massart halfspaces (even for small constant $\eta$). We complement this lower bound with a new learning algorithm for this regime with sample complexity and runtime $d^{O_{\eta}(\log(1/\gamma))} \mathrm{poly}(1/\epsilon)$. Taken together, our results qualitatively characterize the complexity of learning halfspaces in the Massart model.
A Forster transform is an operation that turns a distribution into one with good anti-concentration properties. While a Forster transform does not always exist, we show that any distribution can be efficiently decomposed as a disjoint mixture of few distributions for which a Forster transform exists and can be computed efficiently. As the main application of this result, we obtain the first polynomial-time algorithm for distribution-independent PAC learning of halfspaces in the Massart noise model with strongly polynomial sample complexity, i.e., independent of the bit complexity of the examples. Previous algorithms for this learning problem incurred sample complexity scaling polynomially with the bit complexity, even though such a dependence is not information-theoretically necessary.
We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples. Specifically, we are given a set $T$ of labeled examples $(x, y) \in \mathbb{R}^d \times \mathbb{R}$ and a parameter $0< \alpha <1/2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a linear regression model with Gaussian covariates, and the remaining $(1-\alpha)$-fraction of the points are drawn from an arbitrary noise distribution. The goal is to output a small list of hypothesis vectors such that at least one of them is close to the target regression vector. Our main result is a Statistical Query (SQ) lower bound of $d^{\mathrm{poly}(1/\alpha)}$ for this problem. Our SQ lower bound qualitatively matches the performance of previously developed algorithms, providing evidence that current upper bounds for this task are nearly best possible.
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset. Specifically, we are given a set $T$ of $n$ points in $\mathbb{R}^d$ and a parameter $0< \alpha <\frac 1 2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a well-behaved distribution $\mathcal{D}$ and the remaining $(1-\alpha)$-fraction of the points are arbitrary. The goal is to output a small list of vectors at least one of which is close to the mean of $\mathcal{D}$. As our main contribution, we develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees, with running time $n^{1 + o(1)} d$. All prior algorithms for this problem had additional polynomial factors in $\frac 1 \alpha$. As a corollary, we obtain the first almost-linear time algorithms for clustering mixtures of $k$ separated well-behaved distributions, nearly-matching the statistical guarantees of spectral methods. Prior clustering algorithms inherently relied on an application of $k$-PCA, thereby incurring runtimes of $\Omega(n d k)$. This marks the first runtime improvement for this basic statistical problem in nearly two decades. The starting point of our approach is a novel and simpler near-linear time robust mean estimation algorithm in the $\alpha \to 1$ regime, based on a one-shot matrix multiplicative weights-inspired potential decrease. We crucially leverage this new algorithmic framework in the context of the iterative multi-filtering technique of Diakonikolas et. al. '18, '20, providing a method to simultaneously cluster and downsample points using one-dimensional projections -- thus, bypassing the $k$-PCA subroutines required by prior algorithms.
We study the problem of boosting the accuracy of a weak learner in the (distribution-independent) PAC model with Massart noise. In the Massart noise model, the label of each example $x$ is independently misclassified with probability $\eta(x) \leq \eta$, where $\eta<1/2$. The Massart model lies between the random classification noise model and the agnostic model. Our main positive result is the first computationally efficient boosting algorithm in the presence of Massart noise that achieves misclassification error arbitrarily close to $\eta$. Prior to our work, no non-trivial booster was known in this setting. Moreover, we show that this error upper bound is best possible for polynomial-time black-box boosters, under standard cryptographic assumptions. Our upper and lower bounds characterize the complexity of boosting in the distribution-independent PAC model with Massart noise. As a simple application of our positive result, we give the first efficient Massart learner for unions of high-dimensional rectangles.
We study the problem of agnostically learning halfspaces under the Gaussian distribution. Our main result is the {\em first proper} learning algorithm for this problem whose sample complexity and computational complexity qualitatively match those of the best known improper agnostic learner. Building on this result, we also obtain the first proper polynomial-time approximation scheme (PTAS) for agnostically learning homogeneous halfspaces. Our techniques naturally extend to agnostically learning linear models with respect to other non-linear activations, yielding in particular the first proper agnostic algorithm for ReLU regression.
We study the problem of agnostic learning under the Gaussian distribution. We develop a method for finding hard families of examples for a wide class of problems by using LP duality. For Boolean-valued concept classes, we show that the $L^1$-regression algorithm is essentially best possible, and therefore that the computational difficulty of agnostically learning a concept class is closely related to the polynomial degree required to approximate any function from the class in $L^1$-norm. Using this characterization along with additional analytic tools, we obtain optimal SQ lower bounds for agnostically learning linear threshold functions and the first non-trivial SQ lower bounds for polynomial threshold functions and intersections of halfspaces. We also develop an analogous theory for agnostically learning real-valued functions, and as an application prove near-optimal SQ lower bounds for agnostically learning ReLUs and sigmoids.
We study the problem of learning Ising models satisfying Dobrushin's condition in the outlier-robust setting where a constant fraction of the samples are adversarially corrupted. Our main result is to provide the first computationally efficient robust learning algorithm for this problem with near-optimal error guarantees. Our algorithm can be seen as a special case of an algorithm for robustly learning a distribution from a general exponential family. To prove its correctness for Ising models, we establish new anti-concentration results for degree-$2$ polynomials of Ising models that may be of independent interest.
We study the problem of testing the covariance matrix of a high-dimensional Gaussian in a robust setting, where the input distribution has been corrupted in Huber's contamination model. Specifically, we are given i.i.d. samples from a distribution of the form $Z = (1-\epsilon) X + \epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $\mathcal{N}(0, \Sigma)$, $B$ is a fixed but unknown noise distribution, and $\epsilon>0$ is an arbitrarily small constant representing the proportion of contamination. We want to distinguish between the cases that $\Sigma$ is the identity matrix versus $\gamma$-far from the identity in Frobenius norm. In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples. Moreover, this sample upper bound was shown to be best possible, within constant factors. Our main result is that the sample complexity of covariance testing dramatically increases in the contaminated setting. In particular, we prove a sample complexity lower bound of $\Omega(d^2)$ for $\epsilon$ an arbitrarily small constant and $\gamma = 1/2$. This lower bound is best possible, as $O(d^2)$ samples suffice to even robustly {\em learn} the covariance. The conceptual implication of our result is that, for the natural setting we consider, robust hypothesis testing is at least as hard as robust estimation.