Picture for Paul Christiano

Paul Christiano

Towards a Law of Iterated Expectations for Heuristic Estimators

Add code
Oct 02, 2024
Viaarxiv icon

Backdoor defense, learnability and obfuscation

Add code
Sep 04, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

Evaluating Language-Model Agents on Realistic Autonomous Tasks

Add code
Jan 04, 2024
Viaarxiv icon

Model evaluation for extreme risks

Add code
May 24, 2023
Viaarxiv icon

Formalizing the presumption of independence

Add code
Nov 12, 2022
Viaarxiv icon

Training language models to follow instructions with human feedback

Add code
Mar 04, 2022
Figure 1 for Training language models to follow instructions with human feedback
Figure 2 for Training language models to follow instructions with human feedback
Figure 3 for Training language models to follow instructions with human feedback
Figure 4 for Training language models to follow instructions with human feedback
Viaarxiv icon

Recursively Summarizing Books with Human Feedback

Add code
Sep 27, 2021
Figure 1 for Recursively Summarizing Books with Human Feedback
Figure 2 for Recursively Summarizing Books with Human Feedback
Figure 3 for Recursively Summarizing Books with Human Feedback
Figure 4 for Recursively Summarizing Books with Human Feedback
Viaarxiv icon

Learning to summarize from human feedback

Add code
Sep 02, 2020
Figure 1 for Learning to summarize from human feedback
Figure 2 for Learning to summarize from human feedback
Figure 3 for Learning to summarize from human feedback
Figure 4 for Learning to summarize from human feedback
Viaarxiv icon

Fine-Tuning Language Models from Human Preferences

Add code
Sep 18, 2019
Figure 1 for Fine-Tuning Language Models from Human Preferences
Figure 2 for Fine-Tuning Language Models from Human Preferences
Figure 3 for Fine-Tuning Language Models from Human Preferences
Figure 4 for Fine-Tuning Language Models from Human Preferences
Viaarxiv icon