Picture for Bing Liu

Bing Liu

Jack

Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction

Add code
Dec 16, 2025
Viaarxiv icon

AnaCP: Toward Upper-Bound Continual Learning via Analytic Contrastive Projection

Add code
Nov 17, 2025
Viaarxiv icon

PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Add code
Nov 14, 2025
Viaarxiv icon

Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images

Add code
Nov 12, 2025
Viaarxiv icon

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

Add code
Nov 10, 2025
Viaarxiv icon

Remote Labor Index: Measuring AI Automation of Remote Work

Add code
Oct 30, 2025
Viaarxiv icon

Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

Add code
Oct 14, 2025
Viaarxiv icon

MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs

Add code
Jul 23, 2025
Figure 1 for MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Figure 2 for MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Figure 3 for MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Figure 4 for MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
Viaarxiv icon

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Add code
Jul 23, 2025
Figure 1 for Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Figure 2 for Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Figure 3 for Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Figure 4 for Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Viaarxiv icon

ImprovDML: Improved Trade-off in Private Byzantine-Resilient Distributed Machine Learning

Add code
Jun 18, 2025
Viaarxiv icon