Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Percy Liang

An Investigation of Why Overparameterization Exacerbates Spurious Correlations

May 09, 2020
Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, Percy Liang

Figure 1 for An Investigation of Why Overparameterization Exacerbates Spurious Correlations

Figure 2 for An Investigation of Why Overparameterization Exacerbates Spurious Correlations

Figure 3 for An Investigation of Why Overparameterization Exacerbates Spurious Correlations

Figure 4 for An Investigation of Why Overparameterization Exacerbates Spurious Correlations

We study why overparameterization -- increasing model size well beyond the point of zero training error -- can hurt test error on minority groups despite improving average test error when there are spurious correlations in the data. Through simulations and experiments on two image datasets, we identify two key properties of the training data that drive this behavior: the proportions of majority versus minority groups, and the signal-to-noise ratio of the spurious correlations. We then analyze a linear setting and show theoretically how the inductive bias of models towards "memorizing" fewer examples can cause overparameterization to hurt. Our analysis leads to a counterintuitive approach of subsampling the majority group, which empirically achieves low minority error in the overparameterized regime, even though the standard approach of upweighting the minority fails. Overall, our results suggest a tension between using overparameterized models versus using all the training data for achieving low worst-group error.

Via

Access Paper or Ask Questions

ExpBERT: Representation Engineering with Natural Language Explanations

May 05, 2020
Shikhar Murty, Pang Wei Koh, Percy Liang

Figure 1 for ExpBERT: Representation Engineering with Natural Language Explanations

Figure 2 for ExpBERT: Representation Engineering with Natural Language Explanations

Figure 3 for ExpBERT: Representation Engineering with Natural Language Explanations

Figure 4 for ExpBERT: Representation Engineering with Natural Language Explanations

Suppose we want to specify the inductive bias that married couples typically go on honeymoons for the task of extracting pairs of spouses from text. In this paper, we allow model developers to specify these types of inductive biases as natural language explanations. We use BERT fine-tuned on MultiNLI to ``interpret'' these explanations with respect to the input sentence, producing explanation-guided representations of the input. Across three relation extraction tasks, our method, ExpBERT, matches a BERT baseline but with 3--20x less labeled data and improves on the baseline by 3--10 F1 points with the same amount of labeled data.

* ACL 2020

Via

Access Paper or Ask Questions

Robust Encodings: A Framework for Combating Adversarial Typos

May 04, 2020
Erik Jones, Robin Jia, Aditi Raghunathan, Percy Liang

Figure 1 for Robust Encodings: A Framework for Combating Adversarial Typos

Figure 2 for Robust Encodings: A Framework for Combating Adversarial Typos

Figure 3 for Robust Encodings: A Framework for Combating Adversarial Typos

Figure 4 for Robust Encodings: A Framework for Combating Adversarial Typos

Despite excellent performance on many tasks, NLP systems are easily fooled by small adversarial perturbations of inputs. Existing procedures to defend against such perturbations are either (i) heuristic in nature and susceptible to stronger attacks or (ii) provide guaranteed robustness to worst-case attacks, but are incompatible with state-of-the-art models like BERT. In this work, we introduce robust encodings (RobEn): a simple framework that confers guaranteed robustness, without making compromises on model architecture. The core component of RobEn is an encoding function, which maps sentences to a smaller, discrete space of encodings. Systems using these encodings as a bottleneck confer guaranteed robustness with standard training, and the same encodings can be used across multiple tasks. We identify two desiderata to construct robust encoding functions: perturbations of a sentence should map to a small set of encodings (stability), and models using encodings should still perform well (fidelity). We instantiate RobEn to defend against a large family of adversarial typos. Across six tasks from GLUE, our instantiation of RobEn paired with BERT achieves an average robust accuracy of 71.3% against all adversarial typos in the family considered, while previous work using a typo-corrector achieves only 35.3% accuracy against a simple greedy attack.

* ACL 2020

Via

Access Paper or Ask Questions

Understanding Self-Training for Gradual Domain Adaptation

Feb 26, 2020
Ananya Kumar, Tengyu Ma, Percy Liang

Figure 1 for Understanding Self-Training for Gradual Domain Adaptation

Figure 2 for Understanding Self-Training for Gradual Domain Adaptation

Figure 3 for Understanding Self-Training for Gradual Domain Adaptation

Figure 4 for Understanding Self-Training for Gradual Domain Adaptation

Machine learning systems must adapt to data distributions that evolve over time, in applications ranging from sensor networks and self-driving car perception modules to brain-machine interfaces. We consider gradual domain adaptation, where the goal is to adapt an initial classifier trained on a source domain given only unlabeled data that shifts gradually in distribution towards a target domain. We prove the first non-vacuous upper bound on the error of self-training with gradual shifts, under settings where directly adapting to the target domain can result in unbounded error. The theoretical analysis leads to algorithmic insights, highlighting that regularization and label sharpening are essential even when we have infinite data, and suggesting that self-training works particularly well for shifts with small Wasserstein-infinity distance. Leveraging the gradual shift structure leads to higher accuracies on a rotating MNIST dataset and a realistic Portraits dataset.

Via

Access Paper or Ask Questions

Understanding and Mitigating the Tradeoff Between Robustness and Accuracy

Feb 25, 2020
Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John Duchi, Percy Liang

Figure 1 for Understanding and Mitigating the Tradeoff Between Robustness and Accuracy

Figure 2 for Understanding and Mitigating the Tradeoff Between Robustness and Accuracy

Figure 3 for Understanding and Mitigating the Tradeoff Between Robustness and Accuracy

Figure 4 for Understanding and Mitigating the Tradeoff Between Robustness and Accuracy

Adversarial training augments the training set with perturbations to improve the robust error (over worst-case perturbations), but it often leads to an increase in the standard error (on unperturbed test inputs). Previous explanations for this tradeoff rely on the assumption that no predictor in the hypothesis class has low standard and robust error. In this work, we precisely characterize the effect of augmentation on the standard error in linear regression when the optimal linear predictor has zero standard and robust error. In particular, we show that the standard error could increase even when the augmented perturbations have noiseless observations from the optimal linear predictor. We then prove that the recently proposed robust self-training (RST) estimator improves robust error without sacrificing standard error for noiseless linear regression. Empirically, for neural networks, we find that RST with different adversarial training methods improves both standard and robust error for random and adversarial rotations and adversarial $\ell_\infty$ perturbations in CIFAR-10.

Via

Access Paper or Ask Questions

Noise Induces Loss Discrepancy Across Groups for Linear Regression

Nov 22, 2019
Fereshte Khani, Percy Liang

Figure 1 for Noise Induces Loss Discrepancy Across Groups for Linear Regression

Figure 2 for Noise Induces Loss Discrepancy Across Groups for Linear Regression

Figure 3 for Noise Induces Loss Discrepancy Across Groups for Linear Regression

Figure 4 for Noise Induces Loss Discrepancy Across Groups for Linear Regression

We study the effect of feature noise (measurement error) on the discrepancy between losses across two groups (e.g., men and women) in the context of linear regression. Our main finding is that adding even the same amount of noise on all individuals impacts groups differently. We characterize several forms of loss discrepancy in terms of the amount of noise and difference between moments of the two groups, for estimators that either do or do not use group membership information. We then study how long it takes for an estimator to adapt to a shift in the population that makes the groups have the same mean. We finally validate our results on three real-world datasets.

Via

Access Paper or Ask Questions

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Nov 20, 2019
Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, Percy Liang

Figure 1 for Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Figure 2 for Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Figure 3 for Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Figure 4 for Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Overparameterized neural networks can be highly accurate on average on an i.i.d. test set yet consistently fail on atypical groups of the data (e.g., by learning spurious correlations that hold on average but not in such groups). Distributionally robust optimization (DRO) allows us to learn models that instead minimize the worst-case training loss over a set of pre-defined groups. However, we find that naively applying group DRO to overparameterized neural networks fails: these models can perfectly fit the training data, and any model with vanishing average training loss also already has vanishing worst-case training loss. Instead, their poor worst-case performance arises from poor generalization on some groups. By coupling group DRO models with increased regularization---stronger-than-typical $\ell_2$ regularization or early stopping---we achieve substantially higher worst-group accuracies, with 10-40 percentage point improvements on a natural language inference task and two image tasks, while maintaining high average accuracies. Our results suggest that regularization is critical for worst-group generalization in the overparameterized regime, even if it is not needed for average generalization. Finally, we introduce and give convergence guarantees for a stochastic optimizer for the group DRO setting, underpinning the empirical study above.

Via

Access Paper or Ask Questions

Learning Autocomplete Systems as a Communication Game

Nov 16, 2019
Mina Lee, Tatsunori B. Hashimoto, Percy Liang

Figure 1 for Learning Autocomplete Systems as a Communication Game

Figure 2 for Learning Autocomplete Systems as a Communication Game

Figure 3 for Learning Autocomplete Systems as a Communication Game

Figure 4 for Learning Autocomplete Systems as a Communication Game

We study textual autocomplete---the task of predicting a full sentence from a partial sentence---as a human-machine communication game. Specifically, we consider three competing goals for effective communication: use as few tokens as possible (efficiency), transmit sentences faithfully (accuracy), and be learnable to humans (interpretability). We propose an unsupervised approach which tackles all three desiderata by constraining the communication scheme to keywords extracted from a source sentence for interpretability and optimizing the efficiency-accuracy tradeoff. Our experiments show that this approach results in an autocomplete system that is 52% more accurate at a given efficiency level compared to baselines, is robust to user variations, and saves time by nearly 50% compared to typing full sentences.

Via

Access Paper or Ask Questions

Shaping Visual Representations with Language for Few-shot Classification

Nov 06, 2019
Jesse Mu, Percy Liang, Noah Goodman

Figure 1 for Shaping Visual Representations with Language for Few-shot Classification

Figure 2 for Shaping Visual Representations with Language for Few-shot Classification

Figure 3 for Shaping Visual Representations with Language for Few-shot Classification

Figure 4 for Shaping Visual Representations with Language for Few-shot Classification

Language is designed to convey useful information about the world, thus serving as a scaffold for efficient human learning. How can we let language guide representation learning in machine learning models? We explore this question in the setting of few-shot visual classification, proposing models which learn to perform visual classification while jointly predicting natural language task descriptions at train time. At test time, with no language available, we find that these language-influenced visual representations are more generalizable, compared to meta-learning baselines and approaches that explicitly use language as a bottleneck for classification.

* 9 pages inc. supplement; NeurIPS 2019 Workshop on Visually Grounded Interaction and Language (ViGIL)

Via

Access Paper or Ask Questions

Verified Uncertainty Calibration

Sep 23, 2019
Ananya Kumar, Percy Liang, Tengyu Ma

Figure 1 for Verified Uncertainty Calibration

Figure 2 for Verified Uncertainty Calibration

Figure 3 for Verified Uncertainty Calibration

Figure 4 for Verified Uncertainty Calibration

Applications such as weather forecasting and personalized medicine demand models that output calibrated probability estimates - those representative of the true likelihood of a prediction. Most models are not calibrated out of the box but are recalibrated by post-processing model outputs. We find in this work that popular recalibration methods like Platt scaling and temperature scaling, are (i) less calibrated than reported and (ii) current techniques cannot estimate how miscalibrated they are. An alternative method, histogram binning, has measurable calibration error but is sample inefficient - it requires $O(B/\epsilon^2)$ samples, compared to $O(1/\epsilon^2)$ for scaling methods, where $B$ is the number of distinct probabilities the model can output. To get the best of both worlds, we introduce the scaling-binning calibrator, which first fits a parametric function that acts like a baseline for variance reduction and then bins the function values to actually ensure calibration. This requires only $O(1/\epsilon^2 + B)$ samples. We then show that methods used to estimate calibration error are suboptimal - we prove that an alternative estimator introduced in the meteorological community requires fewer samples - samples proportional to $\sqrt{B}$ instead of $B$. We validate our approach with multiclass calibration experiments on CIFAR-10 and ImageNet, where we obtain a 35% lower calibration error than histogram binning and, unlike scaling methods, guarantees on true calibration.

* Accepted as a spotlight to NeurIPS 2019, original title was "Variance Reduced Uncertainty Calibration"

Via

Access Paper or Ask Questions