Picture for Yichang Zhang

Yichang Zhang

additional authors not shown

RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing

Add code
Jul 27, 2025
Viaarxiv icon

MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

Add code
May 26, 2025
Viaarxiv icon

Qwen3 Technical Report

Add code
May 14, 2025
Viaarxiv icon

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

Add code
Apr 30, 2025
Viaarxiv icon

HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

Add code
Feb 17, 2025
Viaarxiv icon

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

Add code
Jan 03, 2025
Viaarxiv icon

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs

Add code
Dec 27, 2024
Figure 1 for Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs
Figure 2 for Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs
Figure 3 for Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs
Figure 4 for Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs
Viaarxiv icon

Qwen2.5 Technical Report

Add code
Dec 19, 2024
Viaarxiv icon

Evaluating and Aligning CodeLLMs on Human Preference

Add code
Dec 06, 2024
Figure 1 for Evaluating and Aligning CodeLLMs on Human Preference
Figure 2 for Evaluating and Aligning CodeLLMs on Human Preference
Figure 3 for Evaluating and Aligning CodeLLMs on Human Preference
Figure 4 for Evaluating and Aligning CodeLLMs on Human Preference
Viaarxiv icon

Language Models can Self-Lengthen to Generate Long Texts

Add code
Oct 31, 2024
Figure 1 for Language Models can Self-Lengthen to Generate Long Texts
Figure 2 for Language Models can Self-Lengthen to Generate Long Texts
Figure 3 for Language Models can Self-Lengthen to Generate Long Texts
Figure 4 for Language Models can Self-Lengthen to Generate Long Texts
Viaarxiv icon