Picture for Yonatan Belinkov

Yonatan Belinkov

Structured RAG for Answering Aggregative Questions

Add code
Nov 11, 2025
Viaarxiv icon

ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

Add code
Oct 01, 2025
Viaarxiv icon

Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs

Add code
Jul 09, 2025
Figure 1 for Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Figure 2 for Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Figure 3 for Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Figure 4 for Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Viaarxiv icon

Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs

Add code
Jun 11, 2025
Figure 1 for Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Figure 2 for Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Figure 3 for Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Figure 4 for Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Viaarxiv icon

SAEs Are Good for Steering -- If You Select the Right Features

Add code
May 26, 2025
Figure 1 for SAEs Are Good for Steering -- If You Select the Right Features
Figure 2 for SAEs Are Good for Steering -- If You Select the Right Features
Figure 3 for SAEs Are Good for Steering -- If You Select the Right Features
Figure 4 for SAEs Are Good for Steering -- If You Select the Right Features
Viaarxiv icon

Language Models use Lookbacks to Track Beliefs

Add code
May 20, 2025
Viaarxiv icon

MIB: A Mechanistic Interpretability Benchmark

Add code
Apr 17, 2025
Figure 1 for MIB: A Mechanistic Interpretability Benchmark
Figure 2 for MIB: A Mechanistic Interpretability Benchmark
Figure 3 for MIB: A Mechanistic Interpretability Benchmark
Figure 4 for MIB: A Mechanistic Interpretability Benchmark
Viaarxiv icon

Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

Add code
Apr 01, 2025
Viaarxiv icon

How Generative IR Retrieves Documents Mechanistically

Add code
Mar 25, 2025
Figure 1 for How Generative IR Retrieves Documents Mechanistically
Figure 2 for How Generative IR Retrieves Documents Mechanistically
Figure 3 for How Generative IR Retrieves Documents Mechanistically
Figure 4 for How Generative IR Retrieves Documents Mechanistically
Viaarxiv icon

Inside-Out: Hidden Factual Knowledge in LLMs

Add code
Mar 19, 2025
Viaarxiv icon