Picture for Yu Su

Yu Su

OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Add code
Jun 05, 2025
Viaarxiv icon

BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning

Add code
May 29, 2025
Viaarxiv icon

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Add code
May 28, 2025
Viaarxiv icon

ARM: Adaptive Reasoning Model

Add code
May 26, 2025
Viaarxiv icon

Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges

Add code
Apr 30, 2025
Viaarxiv icon

MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools

Add code
Apr 28, 2025
Viaarxiv icon

Completing A Systematic Review in Hours instead of Months with Interactive AI Agents

Add code
Apr 21, 2025
Viaarxiv icon

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Add code
Apr 09, 2025
Viaarxiv icon

An Illusion of Progress? Assessing the Current State of Web Agents

Add code
Apr 02, 2025
Viaarxiv icon

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Add code
Mar 31, 2025
Viaarxiv icon