Alert button
Picture for Jeffrey Ladish

Jeffrey Ladish

Alert button

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Add code
Bookmark button
Alert button
Oct 31, 2023
Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

Figure 1 for BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Figure 2 for BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Figure 3 for BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Figure 4 for BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Viaarxiv icon

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Add code
Bookmark button
Alert button
Oct 31, 2023
Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

Figure 1 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Figure 2 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Figure 3 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Figure 4 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Viaarxiv icon

Constitutional AI: Harmlessness from AI Feedback

Add code
Bookmark button
Alert button
Dec 15, 2022
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan

Figure 1 for Constitutional AI: Harmlessness from AI Feedback
Figure 2 for Constitutional AI: Harmlessness from AI Feedback
Figure 3 for Constitutional AI: Harmlessness from AI Feedback
Figure 4 for Constitutional AI: Harmlessness from AI Feedback
Viaarxiv icon

Measuring Progress on Scalable Oversight for Large Language Models

Add code
Bookmark button
Alert button
Nov 11, 2022
Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Jared Kaplan

Figure 1 for Measuring Progress on Scalable Oversight for Large Language Models
Figure 2 for Measuring Progress on Scalable Oversight for Large Language Models
Figure 3 for Measuring Progress on Scalable Oversight for Large Language Models
Viaarxiv icon