Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nicholas Carlini

Quantifying Memorization Across Neural Language Models

Feb 24, 2022
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang

Figure 1 for Quantifying Memorization Across Neural Language Models

Figure 2 for Quantifying Memorization Across Neural Language Models

Figure 3 for Quantifying Memorization Across Neural Language Models

Figure 4 for Quantifying Memorization Across Neural Language Models

Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim. This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others). We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data. Memorization significantly grows as we increase (1) the capacity of a model, (2) the number of times an example has been duplicated, and (3) the number of tokens of context used to prompt the model. Surprisingly, we find the situation becomes complicated when generalizing these results across model families. On the whole, we find that memorization in LMs is more prevalent than previously believed and will likely get worse as models continues to scale, at least without active mitigations.

Via

Access Paper or Ask Questions

Debugging Differential Privacy: A Case Study for Privacy Auditing

Feb 24, 2022
Florian Tramer, Andreas Terzis, Thomas Steinke, Shuang Song, Matthew Jagielski, Nicholas Carlini

Figure 1 for Debugging Differential Privacy: A Case Study for Privacy Auditing

Differential Privacy can provide provable privacy guarantees for training data in machine learning. However, the presence of proofs does not preclude the presence of errors. Inspired by recent advances in auditing which have been used for estimating lower bounds on differentially private algorithms, here we show that auditing can also be used to find flaws in (purportedly) differentially private schemes. In this case study, we audit a recent open source implementation of a differentially private deep learning algorithm and find, with 99.99999999% confidence, that the implementation does not satisfy the claimed differential privacy guarantee.

Via

Access Paper or Ask Questions

Counterfactual Memorization in Neural Language Models

Dec 24, 2021
Chiyuan Zhang, Daphne Ippolito, Katherine Lee, Matthew Jagielski, Florian Tramèr, Nicholas Carlini

Figure 1 for Counterfactual Memorization in Neural Language Models

Figure 2 for Counterfactual Memorization in Neural Language Models

Figure 3 for Counterfactual Memorization in Neural Language Models

Figure 4 for Counterfactual Memorization in Neural Language Models

Modern neural language models widely used in tasks across NLP risk memorizing sensitive information from their training data. As models continue to scale up in parameters, training data, and compute, understanding memorization in language models is both important from a learning-theoretical point of view, and is practically crucial in real world applications. An open question in previous studies of memorization in language models is how to filter out "common" memorization. In fact, most memorization criteria strongly correlate with the number of occurrences in the training set, capturing "common" memorization such as familiar phrases, public knowledge or templated texts. In this paper, we provide a principled perspective inspired by a taxonomy of human memory in Psychology. From this perspective, we formulate a notion of counterfactual memorization, which characterizes how a model's predictions change if a particular document is omitted during training. We identify and study counterfactually-memorized training examples in standard text datasets. We further estimate the influence of each training example on the validation set and on generated texts, and show that this can provide direct evidence of the source of memorization at test time.

* 43 pages, 34 figures

Via

Access Paper or Ask Questions

Membership Inference Attacks From First Principles

Dec 07, 2021
Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, Florian Tramer

Figure 1 for Membership Inference Attacks From First Principles

Figure 2 for Membership Inference Attacks From First Principles

Figure 3 for Membership Inference Attacks From First Principles

Figure 4 for Membership Inference Attacks From First Principles

A membership inference attack allows an adversary to query a trained machine learning model to predict whether or not a particular example was contained in the model's training dataset. These attacks are currently evaluated using average-case "accuracy" metrics that fail to characterize whether the attack can confidently identify any members of the training set. We argue that attacks should instead be evaluated by computing their true-positive rate at low (e.g., <0.1%) false-positive rates, and find most prior attacks perform poorly when evaluated in this way. To address this we develop a Likelihood Ratio Attack (LiRA) that carefully combines multiple ideas from the literature. Our attack is 10x more powerful at low false-positive rates, and also strictly dominates prior attacks on existing metrics.

Via

Access Paper or Ask Questions

Unsolved Problems in ML Safety

Sep 28, 2021
Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt

Figure 1 for Unsolved Problems in ML Safety

Figure 2 for Unsolved Problems in ML Safety

Figure 3 for Unsolved Problems in ML Safety

Figure 4 for Unsolved Problems in ML Safety

Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings. As with other powerful technologies, safety for ML should be a leading research priority. In response to emerging safety challenges in ML, such as those introduced by recent large-scale models, we provide a new roadmap for ML Safety and refine the technical problems that the field needs to address. We present four problems ready for research, namely withstanding hazards ("Robustness"), identifying hazards ("Monitoring"), steering ML systems ("Alignment"), and reducing risks to how ML systems are handled ("External Safety"). Throughout, we clarify each problem's motivation and provide concrete research directions.

* Position Paper

Via

Access Paper or Ask Questions

Deduplicating Training Data Makes Language Models Better

Jul 14, 2021
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, Nicholas Carlini

Figure 1 for Deduplicating Training Data Makes Language Models Better

Figure 2 for Deduplicating Training Data Makes Language Models Better

Figure 3 for Deduplicating Training Data Makes Language Models Better

Figure 4 for Deduplicating Training Data Makes Language Models Better

We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets -- for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. We release code for reproducing our work and performing dataset deduplication at https://github.com/google-research/deduplicate-text-datasets.

Via

Access Paper or Ask Questions

Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent

Jun 28, 2021
Oliver Bryniarski, Nabeel Hingun, Pedro Pachuca, Vincent Wang, Nicholas Carlini

Figure 1 for Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent

Figure 2 for Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent

Figure 3 for Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent

Figure 4 for Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent

Evading adversarial example detection defenses requires finding adversarial examples that must simultaneously (a) be misclassified by the model and (b) be detected as non-adversarial. We find that existing attacks that attempt to satisfy multiple simultaneous constraints often over-optimize against one constraint at the cost of satisfying another. We introduce Orthogonal Projected Gradient Descent, an improved attack technique to generate adversarial examples that avoids this problem by orthogonalizing the gradients when running standard gradient-based attacks. We use our technique to evade four state-of-the-art detection defenses, reducing their accuracy to 0% while maintaining a 0% detection rate.

Via

Access Paper or Ask Questions

Indicators of Attack Failure: Debugging and Improving Optimization of Adversarial Examples

Jun 18, 2021
Maura Pintor, Luca Demetrio, Angelo Sotgiu, Giovanni Manca, Ambra Demontis, Nicholas Carlini, Battista Biggio, Fabio Roli

Figure 1 for Indicators of Attack Failure: Debugging and Improving Optimization of Adversarial Examples

Figure 2 for Indicators of Attack Failure: Debugging and Improving Optimization of Adversarial Examples

Figure 3 for Indicators of Attack Failure: Debugging and Improving Optimization of Adversarial Examples

Figure 4 for Indicators of Attack Failure: Debugging and Improving Optimization of Adversarial Examples

Evaluating robustness of machine-learning models to adversarial examples is a challenging problem. Many defenses have been shown to provide a false sense of security by causing gradient-based attacks to fail, and they have been broken under more rigorous evaluations. Although guidelines and best practices have been suggested to improve current adversarial robustness evaluations, the lack of automatic testing and debugging tools makes it difficult to apply these recommendations in a systematic manner. In this work, we overcome these limitations by (i) defining a set of quantitative indicators which unveil common failures in the optimization of gradient-based attacks, and (ii) proposing specific mitigation strategies within a systematic evaluation protocol. Our extensive experimental analysis shows that the proposed indicators of failure can be used to visualize, debug and improve current adversarial robustness evaluations, providing a first concrete step towards automatizing and systematizing current adversarial robustness evaluations. Our open-source code is available at: https://github.com/pralab/IndicatorsOfAttackFailure.

Via

Access Paper or Ask Questions

Poisoning and Backdooring Contrastive Learning

Jun 17, 2021
Nicholas Carlini, Andreas Terzis

Figure 1 for Poisoning and Backdooring Contrastive Learning

Figure 2 for Poisoning and Backdooring Contrastive Learning

Figure 3 for Poisoning and Backdooring Contrastive Learning

Figure 4 for Poisoning and Backdooring Contrastive Learning

Contrastive learning methods like CLIP train on noisy and uncurated training datasets. This is cheaper than labeling datasets manually, and even improves out-of-distribution robustness. We show that this practice makes backdoor and poisoning attacks a significant threat. By poisoning just 0.005% of a dataset (e.g., just 150 images of the 3 million-example Conceptual Captions dataset), we can cause the model to misclassify test images by overlaying a small patch. Targeted poisoning attacks, whereby the model misclassifies a particular test input with an adversarially-desired label, are even easier requiring control of less than 0.0001% of the dataset (e.g., just two out of the 3 million images). Our attacks call into question whether training on noisy and uncurated Internet scrapes is desirable.

Via

Access Paper or Ask Questions