Alert button
Picture for Yingyu Liang

Yingyu Liang

Alert button

Provable Guarantees for Neural Networks via Gradient Feature Learning

Oct 19, 2023
Zhenmei Shi, Junyi Wei, Yingyu Liang

Neural networks have achieved remarkable empirical performance, while the current theoretical analysis is not adequate for understanding their success, e.g., the Neural Tangent Kernel approach fails to capture their key feature learning ability, while recent analyses on feature learning are typically problem-specific. This work proposes a unified analysis framework for two-layer networks trained by gradient descent. The framework is centered around the principle of feature learning from gradients, and its effectiveness is demonstrated by applications in several prototypical problems, such as mixtures of Gaussians and parity functions. The framework also sheds light on interesting network learning phenomena such as feature learning beyond kernels and the lottery ticket hypothesis.

* NeurIPS 2023, 71 pages 
Viaarxiv icon

When and How Does Known Class Help Discover Unknown Ones? Provable Understanding Through Spectral Analysis

Aug 09, 2023
Yiyou Sun, Zhenmei Shi, Yingyu Liang, Yixuan Li

Figure 1 for When and How Does Known Class Help Discover Unknown Ones? Provable Understanding Through Spectral Analysis
Figure 2 for When and How Does Known Class Help Discover Unknown Ones? Provable Understanding Through Spectral Analysis
Figure 3 for When and How Does Known Class Help Discover Unknown Ones? Provable Understanding Through Spectral Analysis
Figure 4 for When and How Does Known Class Help Discover Unknown Ones? Provable Understanding Through Spectral Analysis

Novel Class Discovery (NCD) aims at inferring novel classes in an unlabeled set by leveraging prior knowledge from a labeled set with known classes. Despite its importance, there is a lack of theoretical foundations for NCD. This paper bridges the gap by providing an analytical framework to formalize and investigate when and how known classes can help discover novel classes. Tailored to the NCD problem, we introduce a graph-theoretic representation that can be learned by a novel NCD Spectral Contrastive Loss (NSCL). Minimizing this objective is equivalent to factorizing the graph's adjacency matrix, which allows us to derive a provable error bound and provide the sufficient and necessary condition for NCD. Empirically, NSCL can match or outperform several strong baselines on common benchmark datasets, which is appealing for practical usage while enjoying theoretical guarantees.

* ICML 2023 
Viaarxiv icon

Two Heads are Better than One: Towards Better Adversarial Robustness by Combining Transduction and Rejection

May 27, 2023
Nils Palumbo, Yang Guo, Xi Wu, Jiefeng Chen, Yingyu Liang, Somesh Jha

Figure 1 for Two Heads are Better than One: Towards Better Adversarial Robustness by Combining Transduction and Rejection
Figure 2 for Two Heads are Better than One: Towards Better Adversarial Robustness by Combining Transduction and Rejection
Figure 3 for Two Heads are Better than One: Towards Better Adversarial Robustness by Combining Transduction and Rejection
Figure 4 for Two Heads are Better than One: Towards Better Adversarial Robustness by Combining Transduction and Rejection

Both transduction and rejection have emerged as important techniques for defending against adversarial perturbations. A recent work by Tram\`er showed that, in the rejection-only case (no transduction), a strong rejection-solution can be turned into a strong (but computationally inefficient) non-rejection solution. This detector-to-classifier reduction has been mostly applied to give evidence that certain claims of strong selective-model solutions are susceptible, leaving the benefits of rejection unclear. On the other hand, a recent work by Goldwasser et al. showed that rejection combined with transduction can give provable guarantees (for certain problems) that cannot be achieved otherwise. Nevertheless, under recent strong adversarial attacks (GMSA, which has been shown to be much more effective than AutoAttack against transduction), Goldwasser et al.'s work was shown to have low performance in a practical deep-learning setting. In this paper, we take a step towards realizing the promise of transduction+rejection in more realistic scenarios. Theoretically, we show that a novel application of Tram\`er's classifier-to-detector technique in the transductive setting can give significantly improved sample-complexity for robust generalization. While our theoretical construction is computationally inefficient, it guides us to identify an efficient transductive algorithm to learn a selective model. Extensive experiments using state of the art attacks (AutoAttack, GMSA) show that our solutions provide significantly better robust accuracy.

Viaarxiv icon

Stratified Adversarial Robustness with Rejection

May 12, 2023
Jiefeng Chen, Jayaram Raghuram, Jihye Choi, Xi Wu, Yingyu Liang, Somesh Jha

Figure 1 for Stratified Adversarial Robustness with Rejection
Figure 2 for Stratified Adversarial Robustness with Rejection
Figure 3 for Stratified Adversarial Robustness with Rejection
Figure 4 for Stratified Adversarial Robustness with Rejection

Recently, there is an emerging interest in adversarially training a classifier with a rejection option (also known as a selective classifier) for boosting adversarial robustness. While rejection can incur a cost in many applications, existing studies typically associate zero cost with rejecting perturbed inputs, which can result in the rejection of numerous slightly-perturbed inputs that could be correctly classified. In this work, we study adversarially-robust classification with rejection in the stratified rejection setting, where the rejection cost is modeled by rejection loss functions monotonically non-increasing in the perturbation magnitude. We theoretically analyze the stratified rejection setting and propose a novel defense method -- Adversarial Training with Consistent Prediction-based Rejection (CPR) -- for building a robust selective classifier. Experiments on image datasets demonstrate that the proposed method significantly outperforms existing methods under strong adaptive attacks. For instance, on CIFAR-10, CPR reduces the total robust loss (for different rejection losses) by at least 7.3% under both seen and unseen attacks.

* Paper published at International Conference on Machine Learning (ICML'23) 
Viaarxiv icon

Domain Generalization via Nuclear Norm Regularization

Mar 13, 2023
Zhenmei Shi, Yifei Ming, Ying Fan, Frederic Sala, Yingyu Liang

Figure 1 for Domain Generalization via Nuclear Norm Regularization
Figure 2 for Domain Generalization via Nuclear Norm Regularization
Figure 3 for Domain Generalization via Nuclear Norm Regularization

The ability to generalize to unseen domains is crucial for machine learning systems deployed in the real world, especially when we only have data from limited training domains. In this paper, we propose a simple and effective regularization method based on the nuclear norm of the learned features for domain generalization. Intuitively, the proposed regularizer mitigates the impacts of environmental features and encourages learning domain-invariant features. Theoretically, we provide insights into why nuclear norm regularization is more effective compared to ERM and alternative regularization methods. Empirically, we conduct extensive experiments on both synthetic and real datasets. We show that nuclear norm regularization achieves strong performance compared to baselines in a wide range of domain generalization tasks. Moreover, our regularizer is broadly applicable with various methods such as ERM and SWAD with consistently improved performance, e.g., 1.7% and 0.9% test accuracy improvements respectively on the DomainBed benchmark.

* 21 pages 
Viaarxiv icon

The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning

Feb 28, 2023
Zhenmei Shi, Jiefeng Chen, Kunyang Li, Jayaram Raghuram, Xi Wu, Yingyu Liang, Somesh Jha

Figure 1 for The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning
Figure 2 for The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning
Figure 3 for The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning
Figure 4 for The Trade-off between Universality and Label Efficiency of Representations from Contrastive Learning

Pre-training representations (a.k.a. foundation models) has recently become a prevalent learning paradigm, where one first pre-trains a representation using large-scale unlabeled data, and then learns simple predictors on top of the representation using small labeled data from the downstream tasks. There are two key desiderata for the representation: label efficiency (the ability to learn an accurate classifier on top of the representation with a small amount of labeled data) and universality (usefulness across a wide range of downstream tasks). In this paper, we focus on one of the most popular instantiations of this paradigm: contrastive learning with linear probing, i.e., learning a linear predictor on the representation pre-trained by contrastive learning. We show that there exists a trade-off between the two desiderata so that one may not be able to achieve both simultaneously. Specifically, we provide analysis using a theoretical data model and show that, while more diverse pre-training data result in more diverse features for different tasks (improving universality), it puts less emphasis on task-specific features, giving rise to larger sample complexity for down-stream supervised tasks, and thus worse prediction performance. Guided by this analysis, we propose a contrastive regularization method to improve the trade-off. We validate our analysis and method empirically with systematic experiments using real-world datasets and foundation models.

* 42 pages 
Viaarxiv icon

A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features

Jun 03, 2022
Zhenmei Shi, Junyi Wei, Yingyu Liang

An important characteristic of neural networks is their ability to learn representations of the input data with effective features for prediction, which is believed to be a key factor to their superior empirical performance. To better understand the source and benefit of feature learning in neural networks, we consider learning problems motivated by practical data, where the labels are determined by a set of class relevant patterns and the inputs are generated from these along with some background patterns. We prove that neural networks trained by gradient descent can succeed on these problems. The success relies on the emergence and improvement of effective features, which are learned among exponentially many candidates efficiently by exploiting the data (in particular, the structure of the input distribution). In contrast, no linear models on data-independent features of polynomial sizes can learn to as good errors. Furthermore, if the specific input structure is removed, then no polynomial algorithm in the Statistical Query model can learn even weakly. These results provide theoretical evidence showing that feature learning in neural networks depends strongly on the input structure and leads to the superior performance. Our preliminary experimental results on synthetic and real data also provide positive support.

* 81 pages. ICLR2022 Poster 
Viaarxiv icon

On the identifiability of mixtures of ranking models

Jan 31, 2022
Xiaomin Zhang, Xucheng Zhang, Po-Ling Loh, Yingyu Liang

Figure 1 for On the identifiability of mixtures of ranking models
Figure 2 for On the identifiability of mixtures of ranking models

Mixtures of ranking models are standard tools for ranking problems. However, even the fundamental question of parameter identifiability is not fully understood: the identifiability of a mixture model with two Bradley-Terry-Luce (BTL) components has remained open. In this work, we show that popular mixtures of ranking models with two components (Plackett-Luce, multinomial logistic model with slates of size 3, or BTL) are generically identifiable, i.e., the ground-truth parameters can be identified except when they are from a pathological subset of measure zero. We provide a framework for verifying the number of solutions in a general family of polynomial systems using algebraic geometry, and apply it to these mixtures of ranking models. The framework can be applied more broadly to other learning models and may be of independent interest.

* 43 pages, 2 tables. Comments are very welcome 
Viaarxiv icon

Towards Evaluating the Robustness of Neural Networks Learned by Transduction

Oct 27, 2021
Jiefeng Chen, Xi Wu, Yang Guo, Yingyu Liang, Somesh Jha

Figure 1 for Towards Evaluating the Robustness of Neural Networks Learned by Transduction
Figure 2 for Towards Evaluating the Robustness of Neural Networks Learned by Transduction
Figure 3 for Towards Evaluating the Robustness of Neural Networks Learned by Transduction
Figure 4 for Towards Evaluating the Robustness of Neural Networks Learned by Transduction

There has been emerging interest in using transductive learning for adversarial robustness (Goldwasser et al., NeurIPS 2020; Wu et al., ICML 2020; Wang et al., ArXiv 2021). Compared to traditional defenses, these defense mechanisms "dynamically learn" the model based on test-time input; and theoretically, attacking these defenses reduces to solving a bilevel optimization problem, which poses difficulty in crafting adaptive attacks. In this paper, we examine these defense mechanisms from a principled threat analysis perspective. We formulate and analyze threat models for transductive-learning based defenses, and point out important subtleties. We propose the principle of attacking model space for solving bilevel attack objectives, and present Greedy Model Space Attack (GMSA), an attack framework that can serve as a new baseline for evaluating transductive-learning based defenses. Through systematic evaluation, we show that GMSA, even with weak instantiations, can break previous transductive-learning based defenses, which were resilient to previous attacks, such as AutoAttack (Croce and Hein, ICML 2020). On the positive side, we report a somewhat surprising empirical result of "transductive adversarial training": Adversarially retraining the model using fresh randomness at the test time gives a significant increase in robustness against attacks we consider.

* arXiv admin note: substantial text overlap with arXiv:2106.08387 
Viaarxiv icon