Picture for David Lindner

David Lindner

On scalable oversight with weak LLMs judging strong LLMs

Add code
Jul 05, 2024
Figure 1 for On scalable oversight with weak LLMs judging strong LLMs
Figure 2 for On scalable oversight with weak LLMs judging strong LLMs
Figure 3 for On scalable oversight with weak LLMs judging strong LLMs
Figure 4 for On scalable oversight with weak LLMs judging strong LLMs
Viaarxiv icon

Evaluating Frontier Models for Dangerous Capabilities

Add code
Mar 20, 2024
Figure 1 for Evaluating Frontier Models for Dangerous Capabilities
Figure 2 for Evaluating Frontier Models for Dangerous Capabilities
Figure 3 for Evaluating Frontier Models for Dangerous Capabilities
Figure 4 for Evaluating Frontier Models for Dangerous Capabilities
Viaarxiv icon

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Add code
Oct 19, 2023
Figure 1 for Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
Figure 2 for Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
Figure 3 for Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
Figure 4 for Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
Viaarxiv icon

RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback

Add code
Aug 08, 2023
Figure 1 for RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback
Figure 2 for RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

Learning Safety Constraints from Demonstrations with Unknown Rewards

Add code
May 25, 2023
Viaarxiv icon

Tracr: Compiled Transformers as a Laboratory for Interpretability

Add code
Jan 12, 2023
Viaarxiv icon

Red-Teaming the Stable Diffusion Safety Filter

Add code
Oct 11, 2022
Figure 1 for Red-Teaming the Stable Diffusion Safety Filter
Figure 2 for Red-Teaming the Stable Diffusion Safety Filter
Figure 3 for Red-Teaming the Stable Diffusion Safety Filter
Figure 4 for Red-Teaming the Stable Diffusion Safety Filter
Viaarxiv icon

Active Exploration for Inverse Reinforcement Learning

Add code
Jul 18, 2022
Figure 1 for Active Exploration for Inverse Reinforcement Learning
Figure 2 for Active Exploration for Inverse Reinforcement Learning
Figure 3 for Active Exploration for Inverse Reinforcement Learning
Figure 4 for Active Exploration for Inverse Reinforcement Learning
Viaarxiv icon

Humans are not Boltzmann Distributions: Challenges and Opportunities for Modelling Human Feedback and Interaction in Reinforcement Learning

Add code
Jun 27, 2022
Figure 1 for Humans are not Boltzmann Distributions: Challenges and Opportunities for Modelling Human Feedback and Interaction in Reinforcement Learning
Figure 2 for Humans are not Boltzmann Distributions: Challenges and Opportunities for Modelling Human Feedback and Interaction in Reinforcement Learning
Viaarxiv icon