Picture for Jiashuo Liu

Jiashuo Liu

LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics

Add code
Dec 24, 2025
Viaarxiv icon

AInsteinBench: Benchmarking Coding Agents on Scientific Repositories

Add code
Dec 24, 2025
Viaarxiv icon

Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning

Add code
Dec 22, 2025
Viaarxiv icon

DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains

Add code
Nov 14, 2025
Figure 1 for DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
Figure 2 for DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
Figure 3 for DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
Figure 4 for DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
Viaarxiv icon

LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation

Add code
Nov 09, 2025
Viaarxiv icon

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

Add code
Nov 06, 2025
Figure 1 for RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization
Figure 2 for RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization
Figure 3 for RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization
Figure 4 for RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization
Viaarxiv icon

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

Add code
Sep 16, 2025
Figure 1 for FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
Figure 2 for FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
Figure 3 for FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
Figure 4 for FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
Viaarxiv icon

LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence

Add code
Sep 03, 2025
Figure 1 for LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence
Figure 2 for LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence
Figure 3 for LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence
Figure 4 for LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence
Viaarxiv icon

DRO: A Python Library for Distributionally Robust Optimization in Machine Learning

Add code
May 29, 2025
Figure 1 for DRO: A Python Library for Distributionally Robust Optimization in Machine Learning
Figure 2 for DRO: A Python Library for Distributionally Robust Optimization in Machine Learning
Figure 3 for DRO: A Python Library for Distributionally Robust Optimization in Machine Learning
Figure 4 for DRO: A Python Library for Distributionally Robust Optimization in Machine Learning
Viaarxiv icon

Error Slice Discovery via Manifold Compactness

Add code
Jan 31, 2025
Figure 1 for Error Slice Discovery via Manifold Compactness
Figure 2 for Error Slice Discovery via Manifold Compactness
Figure 3 for Error Slice Discovery via Manifold Compactness
Figure 4 for Error Slice Discovery via Manifold Compactness
Viaarxiv icon