Picture for Dan Hendrycks

Dan Hendrycks

UC Berkeley

Improving Alignment and Robustness with Circuit Breakers

Add code
Jun 10, 2024
Viaarxiv icon

Improving Alignment and Robustness with Short Circuiting

Add code
Jun 06, 2024
Viaarxiv icon

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression

Add code
Mar 18, 2024
Figure 1 for Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
Figure 2 for Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
Figure 3 for Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
Figure 4 for Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
Viaarxiv icon

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Add code
Mar 06, 2024
Figure 1 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 2 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 3 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 4 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Viaarxiv icon

Uncovering Latent Human Wellbeing in Language Model Embeddings

Add code
Feb 19, 2024
Viaarxiv icon

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Add code
Feb 06, 2024
Figure 1 for HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Figure 2 for HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Figure 3 for HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Figure 4 for HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Viaarxiv icon

Can LLMs Follow Simple Rules?

Add code
Nov 06, 2023
Figure 1 for Can LLMs Follow Simple Rules?
Figure 2 for Can LLMs Follow Simple Rules?
Figure 3 for Can LLMs Follow Simple Rules?
Figure 4 for Can LLMs Follow Simple Rules?
Viaarxiv icon

Representation Engineering: A Top-Down Approach to AI Transparency

Add code
Oct 10, 2023
Figure 1 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 2 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 3 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 4 for Representation Engineering: A Top-Down Approach to AI Transparency
Viaarxiv icon

Identifying and Mitigating the Security Risks of Generative AI

Add code
Aug 28, 2023
Figure 1 for Identifying and Mitigating the Security Risks of Generative AI
Viaarxiv icon

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Add code
Aug 28, 2023
Figure 1 for AI Deception: A Survey of Examples, Risks, and Potential Solutions
Figure 2 for AI Deception: A Survey of Examples, Risks, and Potential Solutions
Figure 3 for AI Deception: A Survey of Examples, Risks, and Potential Solutions
Figure 4 for AI Deception: A Survey of Examples, Risks, and Potential Solutions
Viaarxiv icon