Picture for Yonatan Belinkov

Yonatan Belinkov

Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer

Add code
Apr 27, 2026
Viaarxiv icon

Reasoning Models Know What's Important, and Encode It in Their Activations

Add code
Apr 20, 2026
Viaarxiv icon

From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Add code
Apr 16, 2026
Viaarxiv icon

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

Add code
Apr 14, 2026
Viaarxiv icon

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Add code
Apr 10, 2026
Viaarxiv icon

Pitfalls in Evaluating Interpretability Agents

Add code
Mar 20, 2026
Viaarxiv icon

Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models

Add code
Feb 26, 2026
Viaarxiv icon

Mechanisms of AI Protein Folding in ESMFold

Add code
Feb 05, 2026
Viaarxiv icon

Decomposing Query-Key Feature Interactions Using Contrastive Covariances

Add code
Feb 04, 2026
Viaarxiv icon

Investigating the Development of Task-Oriented Communication in Vision-Language Models

Add code
Jan 28, 2026
Viaarxiv icon