Alert button
Picture for Alexander Pan

Alexander Pan

Alert button

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Add code
Bookmark button
Alert button
Apr 15, 2024
Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, Benjamin L. Edelman, Zhaowei Zhang, Mario Günther, Anton Korinek, Jose Hernandez-Orallo, Lewis Hammond, Eric Bigelow, Alexander Pan, Lauro Langosco, Tomasz Korbak, Heidi Zhang, Ruiqi Zhong, Seán Ó hÉigeartaigh, Gabriel Recchia, Giulio Corsi, Alan Chan, Markus Anderljung, Lilian Edwards, Yoshua Bengio, Danqi Chen, Samuel Albanie, Tegan Maharaj, Jakob Foerster, Florian Tramer, He He, Atoosa Kasirzadeh, Yejin Choi, David Krueger

Viaarxiv icon

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Add code
Bookmark button
Alert button
Mar 06, 2024
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Liu, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks

Figure 1 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 2 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 3 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Figure 4 for The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Viaarxiv icon

Feedback Loops With Language Models Drive In-Context Reward Hacking

Add code
Bookmark button
Alert button
Feb 09, 2024
Alexander Pan, Erik Jones, Meena Jagadeesan, Jacob Steinhardt

Viaarxiv icon

Representation Engineering: A Top-Down Approach to AI Transparency

Add code
Bookmark button
Alert button
Oct 10, 2023
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks

Figure 1 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 2 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 3 for Representation Engineering: A Top-Down Approach to AI Transparency
Figure 4 for Representation Engineering: A Top-Down Approach to AI Transparency
Viaarxiv icon

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Add code
Bookmark button
Alert button
Apr 06, 2023
Alexander Pan, Chan Jun Shern, Andy Zou, Nathaniel Li, Steven Basart, Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks

Figure 1 for Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Figure 2 for Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Figure 3 for Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Figure 4 for Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Viaarxiv icon

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Add code
Bookmark button
Alert button
Jan 10, 2022
Alexander Pan, Kush Bhatia, Jacob Steinhardt

Figure 1 for The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Figure 2 for The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Figure 3 for The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Figure 4 for The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Viaarxiv icon

Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training

Add code
Bookmark button
Alert button
Oct 19, 2021
Alexander Pan, Yongkyun Lee, Huan Zhang, Yize Chen, Yuanyuan Shi

Figure 1 for Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training
Figure 2 for Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training
Figure 3 for Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training
Figure 4 for Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training
Viaarxiv icon