Picture for Tao Gui

Tao Gui

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Add code
Feb 13, 2026
Viaarxiv icon

Advancing Block Diffusion Language Models for Test-Time Scaling

Add code
Feb 11, 2026
Viaarxiv icon

DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training

Add code
Feb 05, 2026
Viaarxiv icon

Steering LLMs via Scalable Interactive Oversight

Add code
Feb 04, 2026
Viaarxiv icon

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Add code
Feb 04, 2026
Viaarxiv icon

CL-bench: A Benchmark for Context Learning

Add code
Feb 03, 2026
Viaarxiv icon

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

Add code
Feb 03, 2026
Viaarxiv icon

ChartE$^{3}$: A Comprehensive Benchmark for End-to-End Chart Editing

Add code
Jan 29, 2026
Viaarxiv icon

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Add code
Jan 20, 2026
Viaarxiv icon

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Add code
Jan 20, 2026
Viaarxiv icon