Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nicholas Carlini

Are aligned neural networks adversarially aligned?

Jun 26, 2023
Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer, Ludwig Schmidt

Figure 1 for Are aligned neural networks adversarially aligned?

Figure 2 for Are aligned neural networks adversarially aligned?

Figure 3 for Are aligned neural networks adversarially aligned?

Figure 4 for Are aligned neural networks adversarially aligned?

Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study to what extent these models remain aligned, even when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.

Via

Access Paper or Ask Questions

Evading Black-box Classifiers Without Breaking Eggs

Jun 05, 2023
Edoardo Debenedetti, Nicholas Carlini, Florian Tramèr

Figure 1 for Evading Black-box Classifiers Without Breaking Eggs

Figure 2 for Evading Black-box Classifiers Without Breaking Eggs

Figure 3 for Evading Black-box Classifiers Without Breaking Eggs

Figure 4 for Evading Black-box Classifiers Without Breaking Eggs

Decision-based evasion attacks repeatedly query a black-box classifier to generate adversarial examples. Prior work measures the cost of such attacks by the total number of queries made to the classifier. We argue this metric is flawed. Most security-critical machine learning systems aim to weed out "bad" data (e.g., malware, harmful content, etc). Queries to such systems carry a fundamentally asymmetric cost: queries detected as "bad" come at a higher cost because they trigger additional security filters, e.g., usage throttling or account suspension. Yet, we find that existing decision-based attacks issue a large number of "bad" queries, which likely renders them ineffective against security-critical systems. We then design new attacks that reduce the number of bad queries by $1.5$-$7.3\times$, but often at a significant increase in total (non-bad) queries. We thus pose it as an open problem to build black-box attacks that are more effective under realistic cost metrics.

* Code at https://github.com/ethz-privsec/realistic-adv-examples

Via

Access Paper or Ask Questions

Students Parrot Their Teachers: Membership Inference on Model Distillation

Mar 06, 2023
Matthew Jagielski, Milad Nasr, Christopher Choquette-Choo, Katherine Lee, Nicholas Carlini

Figure 1 for Students Parrot Their Teachers: Membership Inference on Model Distillation

Figure 2 for Students Parrot Their Teachers: Membership Inference on Model Distillation

Figure 3 for Students Parrot Their Teachers: Membership Inference on Model Distillation

Figure 4 for Students Parrot Their Teachers: Membership Inference on Model Distillation

Model distillation is frequently proposed as a technique to reduce the privacy leakage of machine learning. These empirical privacy defenses rely on the intuition that distilled ``student'' models protect the privacy of training data, as they only interact with this data indirectly through a ``teacher'' model. In this work, we design membership inference attacks to systematically study the privacy provided by knowledge distillation to both the teacher and student training sets. Our new attacks show that distillation alone provides only limited privacy across a number of domains. We explain the success of our attacks on distillation by showing that membership inference attacks on a private dataset can succeed even if the target model is *never* queried on any actual training points, but only on inputs whose predictions are highly influenced by training data. Finally, we show that our attacks are strongest when student and teacher sets are similar, or when the attacker can poison the teacher set.

* 16 pages, 12 figures

Via

Access Paper or Ask Questions

Randomness in ML Defenses Helps Persistent Attackers and Hinders Evaluators

Feb 27, 2023
Keane Lucas, Matthew Jagielski, Florian Tramèr, Lujo Bauer, Nicholas Carlini

Figure 1 for Randomness in ML Defenses Helps Persistent Attackers and Hinders Evaluators

Figure 2 for Randomness in ML Defenses Helps Persistent Attackers and Hinders Evaluators

Figure 3 for Randomness in ML Defenses Helps Persistent Attackers and Hinders Evaluators

Figure 4 for Randomness in ML Defenses Helps Persistent Attackers and Hinders Evaluators

It is becoming increasingly imperative to design robust ML defenses. However, recent work has found that many defenses that initially resist state-of-the-art attacks can be broken by an adaptive adversary. In this work we take steps to simplify the design of defenses and argue that white-box defenses should eschew randomness when possible. We begin by illustrating a new issue with the deployment of randomized defenses that reduces their security compared to their deterministic counterparts. We then provide evidence that making defenses deterministic simplifies robustness evaluation, without reducing the effectiveness of a truly robust defense. Finally, we introduce a new defense evaluation framework that leverages a defense's deterministic nature to better evaluate its adversarial robustness.

Via

Access Paper or Ask Questions

Poisoning Web-Scale Training Datasets is Practical

Feb 20, 2023
Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr

Figure 1 for Poisoning Web-Scale Training Datasets is Practical

Figure 2 for Poisoning Web-Scale Training Datasets is Practical

Figure 3 for Poisoning Web-Scale Training Datasets is Practical

Figure 4 for Poisoning Web-Scale Training Datasets is Practical

Deep learning models are often trained on distributed, webscale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet content to ensure a dataset annotator's initial view of the dataset differs from the view downloaded by subsequent clients. By exploiting specific invalid trust assumptions, we show how we could have poisoned 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD. Our second attack, frontrunning poisoning, targets web-scale datasets that periodically snapshot crowd-sourced content -- such as Wikipedia -- where an attacker only needs a time-limited window to inject malicious examples. In light of both attacks, we notify the maintainers of each affected dataset and recommended several low-overhead defenses.

Via

Access Paper or Ask Questions

Tight Auditing of Differentially Private Machine Learning

Feb 15, 2023
Milad Nasr, Jamie Hayes, Thomas Steinke, Borja Balle, Florian Tramèr, Matthew Jagielski, Nicholas Carlini, Andreas Terzis

Figure 1 for Tight Auditing of Differentially Private Machine Learning

Figure 2 for Tight Auditing of Differentially Private Machine Learning

Figure 3 for Tight Auditing of Differentially Private Machine Learning

Figure 4 for Tight Auditing of Differentially Private Machine Learning

Auditing mechanisms for differential privacy use probabilistic means to empirically estimate the privacy level of an algorithm. For private machine learning, existing auditing mechanisms are tight: the empirical privacy estimate (nearly) matches the algorithm's provable privacy guarantee. But these auditing techniques suffer from two limitations. First, they only give tight estimates under implausible worst-case assumptions (e.g., a fully adversarial dataset). Second, they require thousands or millions of training runs to produce non-trivial statistical estimates of the privacy leakage. This work addresses both issues. We design an improved auditing scheme that yields tight privacy estimates for natural (not adversarially crafted) datasets -- if the adversary can see all model updates during training. Prior auditing works rely on the same assumption, which is permitted under the standard differential privacy threat model. This threat model is also applicable, e.g., in federated learning settings. Moreover, our auditing scheme requires only two training runs (instead of thousands) to produce tight privacy estimates, by adapting recent advances in tight composition theorems for differential privacy. We demonstrate the utility of our improved auditing schemes by surfacing implementation bugs in private machine learning code that eluded prior auditing techniques.

Via

Access Paper or Ask Questions

Effective Robustness against Natural Distribution Shifts for Models with Different Training Data

Feb 02, 2023
Zhouxing Shi, Nicholas Carlini, Ananth Balashankar, Ludwig Schmidt, Cho-Jui Hsieh, Alex Beutel, Yao Qin

Figure 1 for Effective Robustness against Natural Distribution Shifts for Models with Different Training Data

Figure 2 for Effective Robustness against Natural Distribution Shifts for Models with Different Training Data

Figure 3 for Effective Robustness against Natural Distribution Shifts for Models with Different Training Data

Figure 4 for Effective Robustness against Natural Distribution Shifts for Models with Different Training Data

``Effective robustness'' measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new effective robustness evaluation metric to compare the effective robustness of models trained on different data distributions. To do this we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of the effectiveness robustness and explains the surprising effective robustness gains of zero-shot CLIP-like models exhibited when considering only one ID dataset, while the gains diminish under our evaluation.

Via

Access Paper or Ask Questions

Extracting Training Data from Diffusion Models

Jan 30, 2023
Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace

Figure 1 for Extracting Training Data from Diffusion Models

Figure 2 for Extracting Training Data from Diffusion Models

Figure 3 for Extracting Training Data from Diffusion Models

Figure 4 for Extracting Training Data from Diffusion Models

Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have attracted significant attention due to their ability to generate high-quality synthetic images. In this work, we show that diffusion models memorize individual images from their training data and emit them at generation time. With a generate-and-filter pipeline, we extract over a thousand training examples from state-of-the-art models, ranging from photographs of individual people to trademarked company logos. We also train hundreds of diffusion models in various settings to analyze how different modeling and data decisions affect privacy. Overall, our results show that diffusion models are much less private than prior generative models such as GANs, and that mitigating these vulnerabilities may require new advances in privacy-preserving training.

Via

Access Paper or Ask Questions

Publishing Efficient On-device Models Increases Adversarial Vulnerability

Dec 28, 2022
Sanghyun Hong, Nicholas Carlini, Alexey Kurakin

Figure 1 for Publishing Efficient On-device Models Increases Adversarial Vulnerability

Figure 2 for Publishing Efficient On-device Models Increases Adversarial Vulnerability

Figure 3 for Publishing Efficient On-device Models Increases Adversarial Vulnerability

Figure 4 for Publishing Efficient On-device Models Increases Adversarial Vulnerability

Recent increases in the computational demands of deep neural networks (DNNs) have sparked interest in efficient deep learning mechanisms, e.g., quantization or pruning. These mechanisms enable the construction of a small, efficient version of commercial-scale models with comparable accuracy, accelerating their deployment to resource-constrained devices. In this paper, we study the security considerations of publishing on-device variants of large-scale models. We first show that an adversary can exploit on-device models to make attacking the large models easier. In evaluations across 19 DNNs, by exploiting the published on-device models as a transfer prior, the adversarial vulnerability of the original commercial-scale models increases by up to 100x. We then show that the vulnerability increases as the similarity between a full-scale and its efficient model increase. Based on the insights, we propose a defense, $similarity$-$unpairing$, that fine-tunes on-device models with the objective of reducing the similarity. We evaluated our defense on all the 19 DNNs and found that it reduces the transferability up to 90% and the number of queries required by a factor of 10-100x. Our results suggest that further research is needed on the security (or even privacy) threats caused by publishing those efficient siblings.

* Accepted to IEEE SaTML 2023

Via

Access Paper or Ask Questions

Considerations for Differentially Private Learning with Large-Scale Public Pretraining

Dec 13, 2022
Florian Tramèr, Gautam Kamath, Nicholas Carlini

The performance of differentially private machine learning can be boosted significantly by leveraging the transfer learning capabilities of non-private models pretrained on large public datasets. We critically review this approach. We primarily question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving. We caution that publicizing these models pretrained on Web data as "private" could lead to harm and erode the public's trust in differential privacy as a meaningful definition of privacy. Beyond the privacy considerations of using public data, we further question the utility of this paradigm. We scrutinize whether existing machine learning benchmarks are appropriate for measuring the ability of pretrained models to generalize to sensitive domains, which may be poorly represented in public Web data. Finally, we notice that pretraining has been especially impactful for the largest available models -- models sufficiently large to prohibit end users running them on their own devices. Thus, deploying such models today could be a net loss for privacy, as it would require (private) data to be outsourced to a more compute-powerful third party. We conclude by discussing potential paths forward for the field of private learning, as public pretraining becomes more popular and powerful.

Via

Access Paper or Ask Questions