Picture for Yu Su

Yu Su

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Add code
Jun 26, 2025
Viaarxiv icon

OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Add code
Jun 05, 2025
Viaarxiv icon

BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

Add code
May 29, 2025
Viaarxiv icon

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Add code
May 28, 2025
Viaarxiv icon

ARM: Adaptive Reasoning Model

Add code
May 26, 2025
Viaarxiv icon

Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges

Add code
Apr 30, 2025
Figure 1 for Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges
Figure 2 for Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges
Figure 3 for Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges
Viaarxiv icon

MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools

Add code
Apr 28, 2025
Viaarxiv icon

Completing A Systematic Review in Hours instead of Months with Interactive AI Agents

Add code
Apr 21, 2025
Viaarxiv icon

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Add code
Apr 09, 2025
Figure 1 for SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Figure 2 for SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Figure 3 for SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Figure 4 for SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Viaarxiv icon

An Illusion of Progress? Assessing the Current State of Web Agents

Add code
Apr 02, 2025
Viaarxiv icon