Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

David Wagner

Can LLMs Follow Simple Rules?

Nov 06, 2023

Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, David Wagner

Figure 1 for Can LLMs Follow Simple Rules?

Figure 2 for Can LLMs Follow Simple Rules?

Figure 3 for Can LLMs Follow Simple Rules?

Figure 4 for Can LLMs Follow Simple Rules?

Abstract:As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as "do not generate abusive content", but these may be circumvented by jailbreaking techniques. Evaluating how well LLMs follow developer-provided rules in the face of adversarial inputs typically requires manual review, which slows down monitoring and methods development. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 15 simple text scenarios in which the model is instructed to obey a set of rules in natural language while interacting with the human user. Each scenario has a concise evaluation program to determine whether the model has broken any rules in a conversation. Through manual exploration of model behavior in our scenarios, we identify 6 categories of attack strategies and collect two suites of test cases: one consisting of unique conversations from manual testing and one that systematically implements strategies from the 6 categories. Across various popular proprietary and open models such as GPT-4 and Llama 2, we find that all models are susceptible to a wide variety of adversarial hand-crafted user inputs, though GPT-4 is the best-performing model. Additionally, we evaluate open models under gradient-based attacks and find significant vulnerabilities. We propose RuLES as a challenging new setting for research into exploring and defending against both manual and automatic attacks on LLMs.

* Project website: https://eecs.berkeley.edu/~normanmu/llm_rules

Via

Access Paper or Ask Questions

Defending Against Transfer Attacks From Public Models

Oct 26, 2023

Chawin Sitawarin, Jaewon Chang, David Huang, Wesson Altoyan, David Wagner

Abstract:Adversarial attacks have been a looming and unaddressed threat in the industry. However, through a decade-long history of the robustness evaluation literature, we have learned that mounting a strong or optimal attack is challenging. It requires both machine learning and domain expertise. In other words, the white-box threat model, religiously assumed by a large majority of the past literature, is unrealistic. In this paper, we propose a new practical threat model where the adversary relies on transfer attacks through publicly available surrogate models. We argue that this setting will become the most prevalent for security-sensitive applications in the future. We evaluate the transfer attacks in this setting and propose a specialized defense method based on a game-theoretic perspective. The defenses are evaluated under 24 public models and 11 attack algorithms across three datasets (CIFAR-10, CIFAR-100, and ImageNet). Under this threat model, our defense, PubDef, outperforms the state-of-the-art white-box adversarial training by a large margin with almost no loss in the normal accuracy. For instance, on ImageNet, our defense achieves 62% accuracy under the strongest transfer attack vs only 36% of the best adversarially trained model. Its accuracy when not under attack is only 2% lower than that of an undefended model (78% vs 80%). We release our code at https://github.com/wagner-group/pubdef.

* Under submission. Code available at https://github.com/wagner-group/pubdef

Via

Access Paper or Ask Questions

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

Apr 01, 2023

Yizheng Chen, Zhoujie Ding, Xinyun Chen, David Wagner

Figure 1 for DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

Figure 2 for DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

Figure 3 for DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

Figure 4 for DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

Abstract:We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 non-vulnerable functions extracted from 7,861 commits. Our dataset covers 305 more projects than all previous datasets combined. We show that increasing the diversity and volume of training data improves the performance of deep learning models for vulnerability detection. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. However, we also identify hopeful future research directions. We demonstrate that large language models (LLMs) are the future for vulnerability detection, outperforming Graph Neural Networks (GNNs) with manual feature engineering. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.

Via

Access Paper or Ask Questions

Continuous Learning for Android Malware Detection

Feb 08, 2023

Yizheng Chen, Zhoujie Ding, David Wagner

Figure 1 for Continuous Learning for Android Malware Detection

Figure 2 for Continuous Learning for Android Malware Detection

Figure 3 for Continuous Learning for Android Malware Detection

Figure 4 for Continuous Learning for Android Malware Detection

Abstract:Machine learning methods can detect Android malware with very high accuracy. However, these classifiers have an Achilles heel, concept drift: they rapidly become out of date and ineffective, due to the evolution of malware apps and benign apps. Our research finds that, after training an Android malware classifier on one year's worth of data, the F1 score quickly dropped from 0.99 to 0.76 after 6 months of deployment on new test samples. In this paper, we propose new methods to combat the concept drift problem of Android malware classifiers. Since machine learning technique needs to be continuously deployed, we use active learning: we select new samples for analysts to label, and then add the labeled samples to the training set to retrain the classifier. Our key idea is, similarity-based uncertainty is more robust against concept drift. Therefore, we combine contrastive learning with active learning. We propose a new hierarchical contrastive learning scheme, and a new sample selection technique to continuously train the Android malware classifier. Our evaluation shows that this leads to significant improvements, compared to previously published methods for active learning. Our approach reduces the false negative rate from 16% (for the best baseline) to 10%, while maintaining the same false positive rate (0.6%). Also, our approach maintains more consistent performance across a seven-year time period than past methods.

Via

Access Paper or Ask Questions

REAP: A Large-Scale Realistic Adversarial Patch Benchmark

Dec 12, 2022

Nabeel Hingun, Chawin Sitawarin, Jerry Li, David Wagner

Abstract:Machine learning models are known to be susceptible to adversarial perturbation. One famous attack is the adversarial patch, a sticker with a particularly crafted pattern that makes the model incorrectly predict the object it is placed on. This attack presents a critical threat to cyber-physical systems that rely on cameras such as autonomous cars. Despite the significance of the problem, conducting research in this setting has been difficult; evaluating attacks and defenses in the real world is exceptionally costly while synthetic data are unrealistic. In this work, we propose the REAP (REalistic Adversarial Patch) benchmark, a digital benchmark that allows the user to evaluate patch attacks on real images, and under real-world conditions. Built on top of the Mapillary Vistas dataset, our benchmark contains over 14,000 traffic signs. Each sign is augmented with a pair of geometric and lighting transformations, which can be used to apply a digitally generated patch realistically onto the sign. Using our benchmark, we perform the first large-scale assessments of adversarial patch attacks under realistic conditions. Our experiments suggest that adversarial patch attacks may present a smaller threat than previously believed and that the success rate of an attack on simpler digital simulations is not predictive of its actual effectiveness in practice. We release our benchmark publicly at https://github.com/wagner-group/reap-benchmark.

* Code and benchmark can be found at https://github.com/wagner-group/reap-benchmark

Via

Access Paper or Ask Questions

Part-Based Models Improve Adversarial Robustness

Sep 15, 2022

Chawin Sitawarin, Kornrapat Pongmala, Yizheng Chen, Nicholas Carlini, David Wagner

Figure 1 for Part-Based Models Improve Adversarial Robustness

Figure 2 for Part-Based Models Improve Adversarial Robustness

Figure 3 for Part-Based Models Improve Adversarial Robustness

Figure 4 for Part-Based Models Improve Adversarial Robustness

Abstract:We show that combining human prior knowledge with end-to-end learning can improve the robustness of deep neural networks by introducing a part-based model for object classification. We believe that the richer form of annotation helps guide neural networks to learn more robust features without requiring more samples or larger models. Our model combines a part segmentation model with a tiny classifier and is trained end-to-end to simultaneously segment objects into parts and then classify the segmented object. Empirically, our part-based models achieve both higher accuracy and higher adversarial robustness than a ResNet-50 baseline on all three datasets. For instance, the clean accuracy of our part models is up to 15 percentage points higher than the baseline's, given the same level of robustness. Our experiments indicate that these models also reduce texture bias and yield better robustness against common corruptions and spurious correlations. The code is publicly available at https://github.com/chawins/adv-part-model.

* Code can be found at https://github.com/chawins/adv-part-model

Via

Access Paper or Ask Questions

SLIP: Self-supervision meets Language-Image Pre-training

Dec 23, 2021

Norman Mu, Alexander Kirillov, David Wagner, Saining Xie

Figure 1 for SLIP: Self-supervision meets Language-Image Pre-training

Figure 2 for SLIP: Self-supervision meets Language-Image Pre-training

Figure 3 for SLIP: Self-supervision meets Language-Image Pre-training

Figure 4 for SLIP: Self-supervision meets Language-Image Pre-training

Abstract:Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training with Vision Transformers, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1% linear accuracy) and language supervision (+5.2% zero-shot accuracy).

* Code: https://github.com/facebookresearch/SLIP

Via

Access Paper or Ask Questions

Learning Security Classifiers with Verified Global Robustness Properties

May 24, 2021

Yizheng Chen, Shiqi Wang, Yue Qin, Xiaojing Liao, Suman Jana, David Wagner

Figure 1 for Learning Security Classifiers with Verified Global Robustness Properties

Figure 2 for Learning Security Classifiers with Verified Global Robustness Properties

Figure 3 for Learning Security Classifiers with Verified Global Robustness Properties

Figure 4 for Learning Security Classifiers with Verified Global Robustness Properties

Abstract:Recent works have proposed methods to train classifiers with local robustness properties, which can provably eliminate classes of evasion attacks for most inputs, but not all inputs. Since data distribution shift is very common in security applications, e.g., often observed for malware detection, local robustness cannot guarantee that the property holds for unseen inputs at the time of deploying the classifier. Therefore, it is more desirable to enforce global robustness properties that hold for all inputs, which is strictly stronger than local robustness. In this paper, we present a framework and tools for training classifiers that satisfy global robustness properties. We define new notions of global robustness that are more suitable for security classifiers. We design a novel booster-fixer training framework to enforce global robustness properties. We structure our classifier as an ensemble of logic rules and design a new verifier to verify the properties. In our training algorithm, the booster increases the classifier's capacity, and the fixer enforces verified global robustness properties following counterexample guided inductive synthesis. To the best of our knowledge, the only global robustness property that has been previously achieved is monotonicity. Several previous works have defined global robustness properties, but their training techniques failed to achieve verified global robustness. In comparison, we show that we can train classifiers to satisfy different global robustness properties for three security datasets, and even multiple properties at the same time, with modest impact on the classifier's performance. For example, we train a Twitter spam account classifier to satisfy five global robustness properties, with 5.4% decrease in true positive rate, and 0.1% increase in false positive rate, compared to a baseline XGBoost model that doesn't satisfy any property.

Via

Access Paper or Ask Questions

Fighting Gradients with Gradients: Dynamic Defenses against Adversarial Attacks

May 18, 2021

Dequan Wang, An Ju, Evan Shelhamer, David Wagner, Trevor Darrell

Figure 1 for Fighting Gradients with Gradients: Dynamic Defenses against Adversarial Attacks

Figure 2 for Fighting Gradients with Gradients: Dynamic Defenses against Adversarial Attacks

Figure 3 for Fighting Gradients with Gradients: Dynamic Defenses against Adversarial Attacks

Figure 4 for Fighting Gradients with Gradients: Dynamic Defenses against Adversarial Attacks

Abstract:Adversarial attacks optimize against models to defeat defenses. Existing defenses are static, and stay the same once trained, even while attacks change. We argue that models should fight back, and optimize their defenses against attacks at test time. We propose dynamic defenses, to adapt the model and input during testing, by defensive entropy minimization (dent). Dent alters testing, but not training, for compatibility with existing models and train-time defenses. Dent improves the robustness of adversarially-trained defenses and nominally-trained models against white-box, black-box, and adaptive attacks on CIFAR-10/100 and ImageNet. In particular, dent boosts state-of-the-art defenses by 20+ points absolute against AutoAttack on CIFAR-10 at $\epsilon_\infty$ = 8/255.

Via

Access Paper or Ask Questions

Model-Agnostic Defense for Lane Detection against Adversarial Attack

Mar 01, 2021

Henry Xu, An Ju, David Wagner

Figure 1 for Model-Agnostic Defense for Lane Detection against Adversarial Attack

Figure 2 for Model-Agnostic Defense for Lane Detection against Adversarial Attack

Figure 3 for Model-Agnostic Defense for Lane Detection against Adversarial Attack

Figure 4 for Model-Agnostic Defense for Lane Detection against Adversarial Attack

Abstract:Susceptibility of neural networks to adversarial attack prompts serious safety concerns for lane detection efforts, a domain where such models have been widely applied. Recent work on adversarial road patches have successfully induced perception of lane lines with arbitrary form, presenting an avenue for rogue control of vehicle behavior. In this paper, we propose a modular lane verification system that can catch such threats before the autonomous driving system is misled while remaining agnostic to the particular lane detection model. Our experiments show that implementing the system with a simple convolutional neural network (CNN) can defend against a wide gamut of attacks on lane detection models. With a 10% impact to inference time, we can detect 96% of bounded non-adaptive attacks, 90% of bounded adaptive attacks, and 98% of patch attacks while preserving accurate identification at least 95% of true lanes, indicating that our proposed verification system is effective at mitigating lane detection security risks with minimal overhead.

* 6 pages, 6 figures, 3 tables. Part of AutoSec 2021 proceedings

Via

Access Paper or Ask Questions