Picture for Yunze Xiao

Yunze Xiao

Sentipolis: Emotion-Aware Agents for Social Simulations

Add code
Jan 25, 2026
Viaarxiv icon

The Confidence Dichotomy: Analyzing and Mitigating Miscalibration in Tool-Use Agents

Add code
Jan 12, 2026
Viaarxiv icon

Toward Global Large Language Models in Medicine

Add code
Jan 05, 2026
Viaarxiv icon

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

Add code
Dec 21, 2025
Figure 1 for Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
Figure 2 for Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
Figure 3 for Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
Figure 4 for Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
Viaarxiv icon

Humanizing Machines: Rethinking LLM Anthropomorphism Through a Multi-Level Framework of Design

Add code
Aug 25, 2025
Viaarxiv icon

Synthetic Socratic Debates: Examining Persona Effects on Moral Decision and Persuasion Dynamics

Add code
Jun 14, 2025
Viaarxiv icon

Embracing Contradiction: Theoretical Inconsistency Will Not Impede the Road of Building Responsible AI Systems

Add code
May 23, 2025
Viaarxiv icon

JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Self-Destructive Behavior Content in Jirai Community

Add code
Mar 27, 2025
Figure 1 for JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Self-Destructive Behavior Content in Jirai Community
Figure 2 for JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Self-Destructive Behavior Content in Jirai Community
Figure 3 for JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Self-Destructive Behavior Content in Jirai Community
Figure 4 for JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models' Detection of Human Self-Destructive Behavior Content in Jirai Community
Viaarxiv icon

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

Add code
Mar 13, 2025
Figure 1 for MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
Figure 2 for MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
Viaarxiv icon

ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations

Add code
Jun 18, 2024
Figure 1 for ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations
Figure 2 for ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations
Figure 3 for ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations
Figure 4 for ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations
Viaarxiv icon