Picture for Carson Denison

Carson Denison

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Figure 1 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 2 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 3 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 4 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Viaarxiv icon

Gradient-Based Language Model Red Teaming

Add code
Jan 30, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Add code
Jul 25, 2023
Figure 1 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Figure 2 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Figure 3 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Figure 4 for Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Viaarxiv icon

Measuring Faithfulness in Chain-of-Thought Reasoning

Add code
Jul 17, 2023
Figure 1 for Measuring Faithfulness in Chain-of-Thought Reasoning
Figure 2 for Measuring Faithfulness in Chain-of-Thought Reasoning
Figure 3 for Measuring Faithfulness in Chain-of-Thought Reasoning
Figure 4 for Measuring Faithfulness in Chain-of-Thought Reasoning
Viaarxiv icon

How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy

Add code
Mar 02, 2023
Figure 1 for How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy
Figure 2 for How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy
Figure 3 for How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy
Figure 4 for How to DP-fy ML: A Practical Guide to Machine Learning with Differential Privacy
Viaarxiv icon

Private Ad Modeling with DP-SGD

Add code
Nov 21, 2022
Figure 1 for Private Ad Modeling with DP-SGD
Figure 2 for Private Ad Modeling with DP-SGD
Figure 3 for Private Ad Modeling with DP-SGD
Figure 4 for Private Ad Modeling with DP-SGD
Viaarxiv icon