Picture for Iván Arcuschin

Iván Arcuschin

Automatically Finding Reward Model Biases

Add code
Feb 16, 2026
Viaarxiv icon

Biases in the Blind Spot: Detecting What LLMs Fail to Mention

Add code
Feb 11, 2026
Viaarxiv icon

Mind the Performance Gap: Capability-Behavior Trade-offs in Feature Steering

Add code
Feb 03, 2026
Viaarxiv icon

Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

Add code
Oct 31, 2025
Viaarxiv icon

MIB: A Mechanistic Interpretability Benchmark

Add code
Apr 17, 2025
Figure 1 for MIB: A Mechanistic Interpretability Benchmark
Figure 2 for MIB: A Mechanistic Interpretability Benchmark
Figure 3 for MIB: A Mechanistic Interpretability Benchmark
Figure 4 for MIB: A Mechanistic Interpretability Benchmark
Viaarxiv icon

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Add code
Mar 13, 2025
Figure 1 for Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
Figure 2 for Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
Figure 3 for Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
Figure 4 for Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
Viaarxiv icon

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Add code
Jul 19, 2024
Figure 1 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 2 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 3 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 4 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Viaarxiv icon