Alert button
Picture for Dan Hendrycks

Dan Hendrycks

Alert button

Uncovering Latent Human Wellbeing in Language Model Embeddings

Feb 19, 2024
Pedro Freire, ChengCheng Tan, Adam Gleave, Dan Hendrycks, Scott Emmons

Viaarxiv icon

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Feb 06, 2024
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, Dan Hendrycks

Viaarxiv icon

Can LLMs Follow Simple Rules?

Nov 06, 2023
Norman Mu, Sarah Chen, Zifan Wang, Sizhe Chen, David Karamardian, Lulwa Aljeraisy, Dan Hendrycks, David Wagner

Viaarxiv icon

Representation Engineering: A Top-Down Approach to AI Transparency

Oct 10, 2023
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks

Figure 1 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 2 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 3 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 4 for Representation Engineering: A Top-Down Approach to AI Transparency
Viaarxiv icon

Identifying and Mitigating the Security Risks of Generative AI

Aug 28, 2023
Clark Barrett, Brad Boyd, Ellie Burzstein, Nicholas Carlini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi, Kathleen Fisher, Tatsunori Hashimoto, Dan Hendrycks, Somesh Jha, Daniel Kang, Florian Kerschbaum, Eric Mitchell, John Mitchell, Zulfikar Ramzan, Khawaja Shams, Dawn Song, Ankur Taly, Diyi Yang

Figure 1 for Identifying and Mitigating the Security Risks of Generative AI
Viaarxiv icon

AI Deception: A Survey of Examples, Risks, and Potential Solutions

Aug 28, 2023
Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, Dan Hendrycks

Figure 1 for AI Deception: A Survey of Examples, Risks, and Potential Solutions
Figure 2 for AI Deception: A Survey of Examples, Risks, and Potential Solutions
Figure 3 for AI Deception: A Survey of Examples, Risks, and Potential Solutions
Figure 4 for AI Deception: A Survey of Examples, Risks, and Potential Solutions
Viaarxiv icon

An Overview of Catastrophic AI Risks

Jul 11, 2023
Dan Hendrycks, Mantas Mazeika, Thomas Woodside

Figure 1 for An Overview of Catastrophic AI Risks
Figure 2 for An Overview of Catastrophic AI Risks
Figure 3 for An Overview of Catastrophic AI Risks
Figure 4 for An Overview of Catastrophic AI Risks
Viaarxiv icon

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Jun 20, 2023
Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li

Figure 1 for DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Figure 2 for DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Figure 3 for DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Figure 4 for DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Viaarxiv icon