LLM evals


One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Add code
Mar 10, 2026
Viaarxiv icon

Building a Strong Instruction Language Model for a Less-Resourced Language

Add code
Mar 02, 2026
Viaarxiv icon

Qwen-BIM: developing large language model for BIM-based design with domain-specific benchmark and dataset

Add code
Feb 24, 2026
Viaarxiv icon

Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

Add code
Feb 14, 2026
Viaarxiv icon

When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

Add code
Feb 12, 2026
Viaarxiv icon

Learning to Judge: LLMs Designing and Applying Evaluation Rubrics

Add code
Feb 09, 2026
Viaarxiv icon

Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

Add code
Feb 03, 2026
Viaarxiv icon

PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use

Add code
Jan 28, 2026
Viaarxiv icon

TerraFormer: Automated Infrastructure-as-Code with LLMs Fine-Tuned via Policy-Guided Verifier Feedback

Add code
Jan 13, 2026
Viaarxiv icon

Measuring all the noises of LLM Evals

Add code
Dec 24, 2025
Viaarxiv icon