Evals


RefAlign: Representation Alignment for Reference-to-Video Generation

Add code
Mar 26, 2026
Viaarxiv icon

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

Add code
Mar 26, 2026
Viaarxiv icon

See, Remember, Explore: A Benchmark and Baselines for Streaming Spatial Reasoning

Add code
Mar 25, 2026
Viaarxiv icon

MuQ-Eval: An Open-Source Per-Sample Quality Metric for AI Music Generation Evaluation

Add code
Mar 24, 2026
Viaarxiv icon

A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks

Add code
Mar 23, 2026
Viaarxiv icon

MOSS-TTSD: Text to Spoken Dialogue Generation

Add code
Mar 20, 2026
Viaarxiv icon

Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning

Add code
Mar 20, 2026
Viaarxiv icon

The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus

Add code
Mar 20, 2026
Viaarxiv icon

From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

Add code
Mar 18, 2026
Viaarxiv icon

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

Add code
Mar 17, 2026
Viaarxiv icon