Picture for Jiashuo Liu

Jiashuo Liu

TERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate

Add code
May 13, 2026
Viaarxiv icon

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Add code
May 05, 2026
Viaarxiv icon

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Add code
Feb 11, 2026
Viaarxiv icon

WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

Add code
Feb 09, 2026
Viaarxiv icon

TabularMath: Evaluating Computational Extrapolation in Tabular Learning via Program-Verified Synthesis

Add code
Jan 25, 2026
Viaarxiv icon

FutureX-Pro: Extending Future Prediction to High-Value Vertical Domains

Add code
Jan 18, 2026
Viaarxiv icon

AInsteinBench: Benchmarking Coding Agents on Scientific Repositories

Add code
Dec 24, 2025
Viaarxiv icon

LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics

Add code
Dec 24, 2025
Viaarxiv icon

Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning

Add code
Dec 22, 2025
Viaarxiv icon

DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

Add code
Nov 14, 2025
Figure 1 for DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
Figure 2 for DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
Figure 3 for DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
Figure 4 for DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
Viaarxiv icon