Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Avi Schwarzschild

Forcing Diffuse Distributions out of Language Models

Apr 16, 2024

Yiming Zhang, Avi Schwarzschild, Nicholas Carlini, Zico Kolter, Daphne Ippolito

Figure 1 for Forcing Diffuse Distributions out of Language Models

Figure 2 for Forcing Diffuse Distributions out of Language Models

Figure 3 for Forcing Diffuse Distributions out of Language Models

Figure 4 for Forcing Diffuse Distributions out of Language Models

Abstract:Despite being trained specifically to follow user instructions, today's language models perform poorly when instructed to produce random outputs. For example, when prompted to pick a number uniformly between one and ten Llama-2-13B-chat disproportionately favors the number five, and when tasked with picking a first name at random, Mistral-7B-Instruct chooses Avery 40 times more often than we would expect based on the U.S. population. When these language models are used for real-world tasks where diversity of outputs is crucial, such as language model assisted dataset construction, their inability to produce diffuse distributions over valid choices is a major hurdle. In this work, we propose a fine-tuning method that encourages language models to output distributions that are diffuse over valid outcomes. The methods we introduce generalize across a variety of tasks and distributions and make large language models practical for synthetic dataset generation with little human intervention.

Via

Access Paper or Ask Questions

Benchmarking ChatGPT on Algorithmic Reasoning

Apr 04, 2024

Sean McLeish, Avi Schwarzschild, Tom Goldstein

Figure 1 for Benchmarking ChatGPT on Algorithmic Reasoning

Figure 2 for Benchmarking ChatGPT on Algorithmic Reasoning

Figure 3 for Benchmarking ChatGPT on Algorithmic Reasoning

Figure 4 for Benchmarking ChatGPT on Algorithmic Reasoning

Abstract:We evaluate ChatGPT's ability to solve algorithm problems from the CLRS benchmark suite that is designed for GNNs. The benchmark requires the use of a specified classical algorithm to solve a given problem. We find that ChatGPT outperforms specialist GNN models, using Python to successfully solve these problems. This raises new points in the discussion about learning algorithms with neural networks.

Via

Access Paper or Ask Questions

Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text

Jan 22, 2024

Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein

Figure 1 for Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text

Figure 2 for Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text

Figure 3 for Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text

Figure 4 for Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text

Abstract:Detecting text generated by modern large language models is thought to be hard, as both LLMs and humans can exhibit a wide range of complex behaviors. However, we find that a score based on contrasting two closely related language models is highly accurate at separating human-generated and machine-generated text. Based on this mechanism, we propose a novel LLM detector that only requires simple calculations using a pair of pre-trained LLMs. The method, called Binoculars, achieves state-of-the-art accuracy without any training data. It is capable of spotting machine text from a range of modern LLMs without any model-specific modifications. We comprehensively evaluate Binoculars on a number of text sources and in varied situations. Over a wide range of document types, Binoculars detects over 90% of generated samples from ChatGPT (and other LLMs) at a false positive rate of 0.01%, despite not being trained on any ChatGPT data.

* 20 pages, code available at https://github.com/ahans30/Binoculars

Via

Access Paper or Ask Questions

TOFU: A Task of Fictitious Unlearning for LLMs

Jan 11, 2024

Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, J. Zico Kolter

Abstract:Large language models trained on massive corpora of data from the web can memorize and reproduce sensitive or private data raising both legal and ethical concerns. Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training. Although several methods exist for such unlearning, it is unclear to what extent they result in models equivalent to those where the data to be forgotten was never learned in the first place. To address this challenge, we present TOFU, a Task of Fictitious Unlearning, as a benchmark aimed at helping deepen our understanding of unlearning. We offer a dataset of 200 diverse synthetic author profiles, each consisting of 20 question-answer pairs, and a subset of these profiles called the forget set that serves as the target for unlearning. We compile a suite of metrics that work together to provide a holistic picture of unlearning efficacy. Finally, we provide a set of baseline results from existing unlearning algorithms. Importantly, none of the baselines we consider show effective unlearning motivating continued efforts to develop approaches for unlearning that effectively tune models so that they truly behave as if they were never trained on the forget data at all.

* https://locuslab.github.io/tofu/

Via

Access Paper or Ask Questions

Effective Backdoor Mitigation Depends on the Pre-training Objective

Dec 05, 2023

Sahil Verma, Gantavya Bhatt, Avi Schwarzschild, Soumye Singhal, Arnav Mohanty Das, Chirag Shah, John P Dickerson, Jeff Bilmes

Figure 1 for Effective Backdoor Mitigation Depends on the Pre-training Objective

Figure 2 for Effective Backdoor Mitigation Depends on the Pre-training Objective

Figure 3 for Effective Backdoor Mitigation Depends on the Pre-training Objective

Figure 4 for Effective Backdoor Mitigation Depends on the Pre-training Objective

Abstract:Despite the advanced capabilities of contemporary machine learning (ML) models, they remain vulnerable to adversarial and backdoor attacks. This vulnerability is particularly concerning in real-world deployments, where compromised models may exhibit unpredictable behavior in critical scenarios. Such risks are heightened by the prevalent practice of collecting massive, internet-sourced datasets for pre-training multimodal models, as these datasets may harbor backdoors. Various techniques have been proposed to mitigate the effects of backdooring in these models such as CleanCLIP which is the current state-of-the-art approach. In this work, we demonstrate that the efficacy of CleanCLIP in mitigating backdoors is highly dependent on the particular objective used during model pre-training. We observe that stronger pre-training objectives correlate with harder to remove backdoors behaviors. We show this by training multimodal models on two large datasets consisting of 3 million (CC3M) and 6 million (CC6M) datapoints, under various pre-training objectives, followed by poison removal using CleanCLIP. We find that CleanCLIP is ineffective when stronger pre-training objectives are used, even with extensive hyperparameter tuning. Our findings underscore critical considerations for ML practitioners who pre-train models using large-scale web-curated data and are concerned about potential backdoor threats. Notably, our results suggest that simpler pre-training objectives are more amenable to effective backdoor removal. This insight is pivotal for practitioners seeking to balance the trade-offs between using stronger pre-training objectives and security against backdoor attacks.

* Accepted for oral presentation at BUGS workshop @ NeurIPS 2023 (https://neurips2023-bugs.github.io/)

Via

Access Paper or Ask Questions

NEFTune: Noisy Embeddings Improve Instruction Finetuning

Oct 10, 2023

Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha(+3 more)

Figure 1 for NEFTune: Noisy Embeddings Improve Instruction Finetuning

Figure 2 for NEFTune: Noisy Embeddings Improve Instruction Finetuning

Figure 3 for NEFTune: Noisy Embeddings Improve Instruction Finetuning

Figure 4 for NEFTune: Noisy Embeddings Improve Instruction Finetuning

Abstract:We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.

* 25 pages, Code is available on Github: https://github.com/neelsjain/NEFTune

Via

Access Paper or Ask Questions

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Sep 04, 2023

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, Tom Goldstein

Figure 1 for Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Figure 2 for Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Figure 3 for Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Figure 4 for Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Abstract:As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss the robustness-performance trade-off for each of the defenses considered. We find that the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs. Future research will be needed to uncover whether more powerful optimizers can be developed, or whether the strength of filtering and preprocessing defenses is greater in the LLMs domain than it has been in computer vision.

* 12 pages

Via

Access Paper or Ask Questions

A Cookbook of Self-Supervised Learning

Apr 24, 2023

Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian(+9 more)

Figure 1 for A Cookbook of Self-Supervised Learning

Figure 2 for A Cookbook of Self-Supervised Learning

Figure 3 for A Cookbook of Self-Supervised Learning

Figure 4 for A Cookbook of Self-Supervised Learning

Abstract:Self-supervised learning, dubbed the dark matter of intelligence, is a promising path to advance machine learning. Yet, much like cooking, training SSL methods is a delicate art with a high barrier to entry. While many components are familiar, successfully training a SSL method involves a dizzying set of choices from the pretext tasks to training hyper-parameters. Our goal is to lower the barrier to entry into SSL research by laying the foundations and latest SSL recipes in the style of a cookbook. We hope to empower the curious researcher to navigate the terrain of methods, understand the role of the various knobs, and gain the know-how required to explore how delicious SSL can be.

Via

Access Paper or Ask Questions

Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

Mar 23, 2023

Avi Schwarzschild, Max Cembalest, Karthik Rao, Keegan Hines, John Dickerson

Figure 1 for Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

Figure 2 for Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

Figure 3 for Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

Figure 4 for Reckoning with the Disagreement Problem: Explanation Consensus as a Training Objective

Abstract:As neural networks increasingly make critical decisions in high-stakes settings, monitoring and explaining their behavior in an understandable and trustworthy manner is a necessity. One commonly used type of explainer is post hoc feature attribution, a family of methods for giving each feature in an input a score corresponding to its influence on a model's output. A major limitation of this family of explainers in practice is that they can disagree on which features are more important than others. Our contribution in this paper is a method of training models with this disagreement problem in mind. We do this by introducing a Post hoc Explainer Agreement Regularization (PEAR) loss term alongside the standard term corresponding to accuracy, an additional term that measures the difference in feature attribution between a pair of explainers. We observe on three datasets that we can train a model with this loss term to improve explanation consensus on unseen data, and see improved consensus between explainers other than those used in the loss term. We examine the trade-off between improved consensus and model performance. And finally, we study the influence our method has on feature attribution explanations.

Via

Access Paper or Ask Questions

Neural Auctions Compromise Bidder Information

Feb 28, 2023

Alex Stein, Avi Schwarzschild, Michael Curry, Tom Goldstein, John Dickerson

Figure 1 for Neural Auctions Compromise Bidder Information

Figure 2 for Neural Auctions Compromise Bidder Information

Figure 3 for Neural Auctions Compromise Bidder Information

Figure 4 for Neural Auctions Compromise Bidder Information

Abstract:Single-shot auctions are commonly used as a means to sell goods, for example when selling ad space or allocating radio frequencies, however devising mechanisms for auctions with multiple bidders and multiple items can be complicated. It has been shown that neural networks can be used to approximate optimal mechanisms while satisfying the constraints that an auction be strategyproof and individually rational. We show that despite such auctions maximizing revenue, they do so at the cost of revealing private bidder information. While randomness is often used to build in privacy, in this context it comes with complications if done without care. Specifically, it can violate rationality and feasibility constraints, fundamentally change the incentive structure of the mechanism, and/or harm top-level metrics such as revenue and social welfare. We propose a method that employs stochasticity to improve privacy while meeting the requirements for auction mechanisms with only a modest sacrifice in revenue. We analyze the cost to the auction house that comes with introducing varying degrees of privacy in common auction settings. Our results show that despite current neural auctions' ability to approximate optimal mechanisms, the resulting vulnerability that comes with relying on neural networks must be accounted for.

Via

Access Paper or Ask Questions