Picture for Xuezhi Cao

Xuezhi Cao

Alphabetical order by last name

AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

Add code
Oct 30, 2025
Figure 1 for AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Figure 2 for AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Figure 3 for AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Figure 4 for AMO-Bench: Large Language Models Still Struggle in High School Math Competitions
Viaarxiv icon

CATArena: Evaluation of LLM Agents through Iterative Tournament Competitions

Add code
Oct 30, 2025
Viaarxiv icon

Making Mathematical Reasoning Adaptive

Add code
Oct 06, 2025
Figure 1 for Making Mathematical Reasoning Adaptive
Figure 2 for Making Mathematical Reasoning Adaptive
Figure 3 for Making Mathematical Reasoning Adaptive
Figure 4 for Making Mathematical Reasoning Adaptive
Viaarxiv icon

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

Add code
Sep 18, 2025
Viaarxiv icon

Instance-level Randomization: Toward More Stable LLM Evaluations

Add code
Sep 16, 2025
Figure 1 for Instance-level Randomization: Toward More Stable LLM Evaluations
Figure 2 for Instance-level Randomization: Toward More Stable LLM Evaluations
Figure 3 for Instance-level Randomization: Toward More Stable LLM Evaluations
Figure 4 for Instance-level Randomization: Toward More Stable LLM Evaluations
Viaarxiv icon

HKD4VLM: A Progressive Hybrid Knowledge Distillation Framework for Robust Multimodal Hallucination and Factuality Detection in VLMs

Add code
Jun 16, 2025
Viaarxiv icon

OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics

Add code
Jun 12, 2025
Figure 1 for OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics
Figure 2 for OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics
Figure 3 for OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics
Figure 4 for OIBench: Benchmarking Strong Reasoning Models with Olympiad in Informatics
Viaarxiv icon

NTIRE 2025 challenge on Text to Image Generation Model Quality Assessment

Add code
May 22, 2025
Viaarxiv icon

ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations

Add code
May 20, 2025
Figure 1 for ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations
Figure 2 for ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations
Figure 3 for ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations
Figure 4 for ViC-Bench: Benchmarking Visual-Interleaved Chain-of-Thought Capability in MLLMs with Free-Style Intermediate State Representations
Viaarxiv icon

Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement

Add code
May 17, 2025
Viaarxiv icon