Picture for Andy Zou

Andy Zou

Shammie

Improving Alignment and Robustness with Circuit Breakers

Add code
Jun 10, 2024
Viaarxiv icon

Improving Alignment and Robustness with Short Circuiting

Add code
Jun 06, 2024
Viaarxiv icon

Lessons from the Trenches on Reproducible Evaluation of Language Models

Add code
May 23, 2024
Figure 1 for Lessons from the Trenches on Reproducible Evaluation of Language Models
Figure 2 for Lessons from the Trenches on Reproducible Evaluation of Language Models
Figure 3 for Lessons from the Trenches on Reproducible Evaluation of Language Models
Figure 4 for Lessons from the Trenches on Reproducible Evaluation of Language Models
Viaarxiv icon

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Add code
Mar 06, 2024
Figure 1 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 2 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 3 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 4 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Viaarxiv icon

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Add code
Feb 06, 2024
Figure 1 for HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Figure 2 for HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Figure 3 for HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Figure 4 for HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Viaarxiv icon

Representation Engineering: A Top-Down Approach to AI Transparency

Add code
Oct 10, 2023
Figure 1 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 2 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 3 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 4 for Representation Engineering: A Top-Down Approach to AI Transparency
Viaarxiv icon

Universal and Transferable Adversarial Attacks on Aligned Language Models

Add code
Jul 27, 2023
Figure 1 for Universal and Transferable Adversarial Attacks on Aligned Language Models
Figure 2 for Universal and Transferable Adversarial Attacks on Aligned Language Models
Figure 3 for Universal and Transferable Adversarial Attacks on Aligned Language Models
Figure 4 for Universal and Transferable Adversarial Attacks on Aligned Language Models
Viaarxiv icon

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Add code
Apr 06, 2023
Figure 1 for Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Figure 2 for Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Figure 3 for Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Figure 4 for Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Viaarxiv icon

Papaya: Federated Learning, but Fully Decentralized

Add code
Mar 10, 2023
Figure 1 for Papaya: Federated Learning, but Fully Decentralized
Figure 2 for Papaya: Federated Learning, but Fully Decentralized
Figure 3 for Papaya: Federated Learning, but Fully Decentralized
Viaarxiv icon

Scaling in Depth: Unlocking Robustness Certification on ImageNet

Add code
Jan 29, 2023
Figure 1 for Scaling in Depth: Unlocking Robustness Certification on ImageNet
Figure 2 for Scaling in Depth: Unlocking Robustness Certification on ImageNet
Figure 3 for Scaling in Depth: Unlocking Robustness Certification on ImageNet
Figure 4 for Scaling in Depth: Unlocking Robustness Certification on ImageNet
Viaarxiv icon