Chinese Question


FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

Add code
Feb 03, 2026
Viaarxiv icon

The Mask of Civility: Benchmarking Chinese Mock Politeness Comprehension in Large Language Models

Add code
Feb 03, 2026
Viaarxiv icon

Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations

Add code
Feb 01, 2026
Viaarxiv icon

JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs

Add code
Jan 30, 2026
Viaarxiv icon

MRAG: Benchmarking Retrieval-Augmented Generation for Bio-medicine

Add code
Jan 23, 2026
Viaarxiv icon

Mitigating Cultural Bias in LLMs via Multi-Agent Cultural Debate

Add code
Jan 17, 2026
Viaarxiv icon

Chinese Labor Law Large Language Model Benchmark

Add code
Jan 15, 2026
Viaarxiv icon

MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus

Add code
Jan 14, 2026
Viaarxiv icon

The performances of the Chinese and U.S. Large Language Models on the Topic of Chinese Culture

Add code
Jan 07, 2026
Viaarxiv icon

Knowing But Not Doing: Convergent Morality and Divergent Action in LLMs

Add code
Jan 12, 2026
Viaarxiv icon