Picture for Percy Liang

Percy Liang

Shammie

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Add code
May 26, 2025
Viaarxiv icon

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Add code
May 21, 2025
Viaarxiv icon

Extracting memorized pieces of (copyrighted) books from open-weight language models

Add code
May 18, 2025
Viaarxiv icon

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

Add code
May 12, 2025
Viaarxiv icon

Reliable and Efficient Amortized Model-based Evaluation

Add code
Mar 17, 2025
Viaarxiv icon

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Add code
Feb 27, 2025
Viaarxiv icon

The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

Add code
Feb 26, 2025
Viaarxiv icon

Independence Tests for Language Models

Add code
Feb 17, 2025
Viaarxiv icon

Auditing Prompt Caching in Language Model APIs

Add code
Feb 11, 2025
Viaarxiv icon

Language Models Prefer What They Know: Relative Confidence Estimation via Confidence Preferences

Add code
Feb 03, 2025
Viaarxiv icon