Picture for Leshem Choshen

Leshem Choshen

Mediocrity is the key for LLM as a Judge Anchor Selection

Add code
Mar 17, 2026
Viaarxiv icon

CUBE: A Standard for Unifying Agent Benchmarks

Add code
Mar 16, 2026
Viaarxiv icon

Resolving Interference (RI): Disentangling Models for Improved Model Merging

Add code
Mar 13, 2026
Viaarxiv icon

Do LLMs Benefit From Their Own Words?

Add code
Feb 27, 2026
Viaarxiv icon

General Agent Evaluation

Add code
Feb 26, 2026
Viaarxiv icon

BabyLM Turns 4 and Goes Multilingual: Call for Papers for the 2026 BabyLM Workshop

Add code
Feb 24, 2026
Viaarxiv icon

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Add code
Feb 18, 2026
Viaarxiv icon

Robustness as an Emergent Property of Task Performance

Add code
Feb 03, 2026
Viaarxiv icon

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Add code
Jan 25, 2026
Viaarxiv icon

ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Add code
Jan 22, 2026
Viaarxiv icon