Picture for Samuel R. Bowman

Samuel R. Bowman

Shammie

Spontaneous Reward Hacking in Iterative Self-Refinement

Add code
Jul 05, 2024
Viaarxiv icon

Steering Without Side Effects: Improving Post-Deployment Control of Language Models

Add code
Jun 21, 2024
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Figure 1 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 2 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 3 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Figure 4 for Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Viaarxiv icon

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models

Add code
Apr 24, 2024
Figure 1 for Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
Figure 2 for Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
Figure 3 for Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
Figure 4 for Let's Think Dot by Dot: Hidden Computation in Transformer Language Models
Viaarxiv icon

LLM Evaluators Recognize and Favor Their Own Generations

Add code
Apr 15, 2024
Figure 1 for LLM Evaluators Recognize and Favor Their Own Generations
Figure 2 for LLM Evaluators Recognize and Favor Their Own Generations
Figure 3 for LLM Evaluators Recognize and Favor Their Own Generations
Figure 4 for LLM Evaluators Recognize and Favor Their Own Generations
Viaarxiv icon

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Add code
Mar 08, 2024
Figure 1 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 2 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 3 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 4 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Viaarxiv icon

Debating with More Persuasive LLMs Leads to More Truthful Answers

Add code
Feb 15, 2024
Figure 1 for Debating with More Persuasive LLMs Leads to More Truthful Answers
Figure 2 for Debating with More Persuasive LLMs Leads to More Truthful Answers
Figure 3 for Debating with More Persuasive LLMs Leads to More Truthful Answers
Figure 4 for Debating with More Persuasive LLMs Leads to More Truthful Answers
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Add code
Nov 20, 2023
Figure 1 for GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Figure 2 for GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Figure 3 for GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Figure 4 for GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Viaarxiv icon

Debate Helps Supervise Unreliable Experts

Add code
Nov 15, 2023
Figure 1 for Debate Helps Supervise Unreliable Experts
Figure 2 for Debate Helps Supervise Unreliable Experts
Figure 3 for Debate Helps Supervise Unreliable Experts
Figure 4 for Debate Helps Supervise Unreliable Experts
Viaarxiv icon