Picture for Tianneng Shi

Tianneng Shi

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

Add code
Jun 11, 2026
Viaarxiv icon

Agents' Last Exam

Add code
Jun 03, 2026
Viaarxiv icon

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

Add code
Jun 03, 2026
Viaarxiv icon

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Add code
May 06, 2026
Viaarxiv icon

Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

Add code
Feb 10, 2026
Viaarxiv icon

DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle

Add code
Jan 27, 2026
Viaarxiv icon

FrontierCS: Evolving Challenges for Evolving Intelligence

Add code
Dec 17, 2025
Figure 1 for FrontierCS: Evolving Challenges for Evolving Intelligence
Figure 2 for FrontierCS: Evolving Challenges for Evolving Intelligence
Figure 3 for FrontierCS: Evolving Challenges for Evolving Intelligence
Figure 4 for FrontierCS: Evolving Challenges for Evolving Intelligence
Viaarxiv icon

AgentXploit: End-to-End Redteaming of Black-Box AI Agents

Add code
May 09, 2025
Figure 1 for AgentXploit: End-to-End Redteaming of Black-Box AI Agents
Figure 2 for AgentXploit: End-to-End Redteaming of Black-Box AI Agents
Figure 3 for AgentXploit: End-to-End Redteaming of Black-Box AI Agents
Figure 4 for AgentXploit: End-to-End Redteaming of Black-Box AI Agents
Viaarxiv icon

Progent: Programmable Privilege Control for LLM Agents

Add code
Apr 16, 2025
Figure 1 for Progent: Programmable Privilege Control for LLM Agents
Figure 2 for Progent: Programmable Privilege Control for LLM Agents
Figure 3 for Progent: Programmable Privilege Control for LLM Agents
Figure 4 for Progent: Programmable Privilege Control for LLM Agents
Viaarxiv icon

Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs

Add code
Apr 07, 2025
Figure 1 for Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs
Figure 2 for Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs
Figure 3 for Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs
Figure 4 for Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs
Viaarxiv icon