Picture for Nevan Wichers

Nevan Wichers

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Add code
Oct 06, 2025
Viaarxiv icon

Visualizing Neural Network Imagination

Add code
May 10, 2024
Figure 1 for Visualizing Neural Network Imagination
Figure 2 for Visualizing Neural Network Imagination
Figure 3 for Visualizing Neural Network Imagination
Figure 4 for Visualizing Neural Network Imagination
Viaarxiv icon

Gradient-Based Language Model Red Teaming

Add code
Jan 30, 2024
Figure 1 for Gradient-Based Language Model Red Teaming
Figure 2 for Gradient-Based Language Model Red Teaming
Figure 3 for Gradient-Based Language Model Red Teaming
Figure 4 for Gradient-Based Language Model Red Teaming
Viaarxiv icon

DRLC: Reinforcement Learning with Dense Rewards from LLM Critic

Add code
Jan 14, 2024
Figure 1 for DRLC: Reinforcement Learning with Dense Rewards from LLM Critic
Figure 2 for DRLC: Reinforcement Learning with Dense Rewards from LLM Critic
Figure 3 for DRLC: Reinforcement Learning with Dense Rewards from LLM Critic
Figure 4 for DRLC: Reinforcement Learning with Dense Rewards from LLM Critic
Viaarxiv icon

Fusion-Eval: Integrating Evaluators with LLMs

Add code
Nov 15, 2023
Figure 1 for Fusion-Eval: Integrating Evaluators with LLMs
Figure 2 for Fusion-Eval: Integrating Evaluators with LLMs
Figure 3 for Fusion-Eval: Integrating Evaluators with LLMs
Figure 4 for Fusion-Eval: Integrating Evaluators with LLMs
Viaarxiv icon

SiRA: Sparse Mixture of Low Rank Adaptation

Add code
Nov 15, 2023
Figure 1 for SiRA: Sparse Mixture of Low Rank Adaptation
Figure 2 for SiRA: Sparse Mixture of Low Rank Adaptation
Figure 3 for SiRA: Sparse Mixture of Low Rank Adaptation
Figure 4 for SiRA: Sparse Mixture of Low Rank Adaptation
Viaarxiv icon

SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition

Add code
Feb 10, 2022
Figure 1 for SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition
Figure 2 for SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition
Figure 3 for SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition
Figure 4 for SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition
Viaarxiv icon

ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces

Add code
Jan 25, 2021
Figure 1 for ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces
Figure 2 for ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces
Figure 3 for ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces
Figure 4 for ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces
Viaarxiv icon

RL agents Implicitly Learning Human Preferences

Add code
Feb 14, 2020
Figure 1 for RL agents Implicitly Learning Human Preferences
Figure 2 for RL agents Implicitly Learning Human Preferences
Viaarxiv icon

Resolving Spurious Correlations in Causal Models of Environments via Interventions

Add code
Feb 12, 2020
Viaarxiv icon