Picture for Teun van der Weij

Teun van der Weij

The Elicitation Game: Evaluating Capability Elicitation Techniques

Add code
Feb 04, 2025
Figure 1 for The Elicitation Game: Evaluating Capability Elicitation Techniques
Figure 2 for The Elicitation Game: Evaluating Capability Elicitation Techniques
Figure 3 for The Elicitation Game: Evaluating Capability Elicitation Techniques
Figure 4 for The Elicitation Game: Evaluating Capability Elicitation Techniques
Viaarxiv icon

Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models

Add code
Dec 02, 2024
Figure 1 for Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Figure 2 for Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Figure 3 for Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Figure 4 for Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models
Viaarxiv icon

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Add code
Jun 12, 2024
Figure 1 for AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Figure 2 for AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Figure 3 for AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Figure 4 for AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Viaarxiv icon

Extending Activation Steering to Broad Skills and Multiple Behaviours

Add code
Mar 09, 2024
Figure 1 for Extending Activation Steering to Broad Skills and Multiple Behaviours
Figure 2 for Extending Activation Steering to Broad Skills and Multiple Behaviours
Figure 3 for Extending Activation Steering to Broad Skills and Multiple Behaviours
Figure 4 for Extending Activation Steering to Broad Skills and Multiple Behaviours
Viaarxiv icon

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

Add code
Jul 03, 2023
Figure 1 for Evaluating Shutdown Avoidance of Language Models in Textual Scenarios
Figure 2 for Evaluating Shutdown Avoidance of Language Models in Textual Scenarios
Viaarxiv icon