Picture for Daniel M. Ziegler

Daniel M. Ziegler

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

Adversarial Training for High-Stakes Reliability

Add code
May 04, 2022
Figure 1 for Adversarial Training for High-Stakes Reliability
Figure 2 for Adversarial Training for High-Stakes Reliability
Figure 3 for Adversarial Training for High-Stakes Reliability
Figure 4 for Adversarial Training for High-Stakes Reliability
Viaarxiv icon

Recursively Summarizing Books with Human Feedback

Add code
Sep 27, 2021
Figure 1 for Recursively Summarizing Books with Human Feedback
Figure 2 for Recursively Summarizing Books with Human Feedback
Figure 3 for Recursively Summarizing Books with Human Feedback
Figure 4 for Recursively Summarizing Books with Human Feedback
Viaarxiv icon

Scaling Laws for Autoregressive Generative Modeling

Add code
Nov 06, 2020
Figure 1 for Scaling Laws for Autoregressive Generative Modeling
Figure 2 for Scaling Laws for Autoregressive Generative Modeling
Figure 3 for Scaling Laws for Autoregressive Generative Modeling
Figure 4 for Scaling Laws for Autoregressive Generative Modeling
Viaarxiv icon

Learning to summarize from human feedback

Add code
Sep 02, 2020
Figure 1 for Learning to summarize from human feedback
Figure 2 for Learning to summarize from human feedback
Figure 3 for Learning to summarize from human feedback
Figure 4 for Learning to summarize from human feedback
Viaarxiv icon

Language Models are Few-Shot Learners

Add code
Jun 05, 2020
Figure 1 for Language Models are Few-Shot Learners
Figure 2 for Language Models are Few-Shot Learners
Figure 3 for Language Models are Few-Shot Learners
Figure 4 for Language Models are Few-Shot Learners
Viaarxiv icon

Fine-Tuning Language Models from Human Preferences

Add code
Sep 18, 2019
Figure 1 for Fine-Tuning Language Models from Human Preferences
Figure 2 for Fine-Tuning Language Models from Human Preferences
Figure 3 for Fine-Tuning Language Models from Human Preferences
Figure 4 for Fine-Tuning Language Models from Human Preferences
Viaarxiv icon