Picture for Wenyue Hua

Wenyue Hua

MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation

Add code
Jun 25, 2025
Viaarxiv icon

Semantic Scheduling for LLM Inference

Add code
Jun 13, 2025
Viaarxiv icon

THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models

Add code
Apr 17, 2025
Viaarxiv icon

REALM: A Dataset of Real-World LLM Use Cases

Add code
Mar 24, 2025
Viaarxiv icon

AgentOrca: A Dual-System Framework to Evaluate Language Agents on Operational Routine and Constraint Adherence

Add code
Mar 11, 2025
Viaarxiv icon

InductionBench: LLMs Fail in the Simplest Complexity Class

Add code
Feb 26, 2025
Viaarxiv icon

Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents

Add code
Feb 18, 2025
Viaarxiv icon

ADO: Automatic Data Optimization for Inputs in LLM Prompts

Add code
Feb 17, 2025
Viaarxiv icon

Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense

Add code
Jan 05, 2025
Viaarxiv icon

RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios

Add code
Dec 12, 2024
Figure 1 for RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
Figure 2 for RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
Figure 3 for RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
Figure 4 for RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
Viaarxiv icon