Alert button
Picture for Nelson Elhage

Nelson Elhage

Alert button

Language Models (Mostly) Know What They Know

Jul 16, 2022
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, Jared Kaplan

Figure 1 for Language Models (Mostly) Know What They Know
Figure 2 for Language Models (Mostly) Know What They Know
Figure 3 for Language Models (Mostly) Know What They Know
Figure 4 for Language Models (Mostly) Know What They Know
Viaarxiv icon

Scaling Laws and Interpretability of Learning from Repeated Data

May 21, 2022
Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, Scott Johnston, Ben Mann, Chris Olah, Catherine Olsson, Dario Amodei, Nicholas Joseph, Jared Kaplan, Sam McCandlish

Figure 1 for Scaling Laws and Interpretability of Learning from Repeated Data
Figure 2 for Scaling Laws and Interpretability of Learning from Repeated Data
Figure 3 for Scaling Laws and Interpretability of Learning from Repeated Data
Figure 4 for Scaling Laws and Interpretability of Learning from Repeated Data
Viaarxiv icon

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Apr 12, 2022
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan

Figure 1 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 2 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 3 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Figure 4 for Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Viaarxiv icon

A General Language Assistant as a Laboratory for Alignment

Dec 09, 2021
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Jared Kaplan

Figure 1 for A General Language Assistant as a Laboratory for Alignment
Figure 2 for A General Language Assistant as a Laboratory for Alignment
Figure 3 for A General Language Assistant as a Laboratory for Alignment
Figure 4 for A General Language Assistant as a Laboratory for Alignment
Viaarxiv icon