Picture for Valerie Chen

Valerie Chen

Comparing Developer and LLM Biases in Code Evaluation

Add code
Mar 25, 2026
Viaarxiv icon

RCTs & Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation

Add code
Mar 11, 2026
Viaarxiv icon

A Rubric-Supervised Critic from Sparse Real-World Outcomes

Add code
Mar 04, 2026
Viaarxiv icon

How Well Does Agent Development Reflect Real-World Work?

Add code
Mar 01, 2026
Viaarxiv icon

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Add code
Feb 11, 2026
Viaarxiv icon

SWE-Tester: Training Open-Source LLMs for Issue Reproduction in Real-World Repositories

Add code
Jan 20, 2026
Viaarxiv icon

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

Add code
Nov 05, 2025
Figure 1 for The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents
Figure 2 for The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents
Figure 3 for The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents
Figure 4 for The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents
Viaarxiv icon

Completion $ eq$ Collaboration: Scaling Collaborative Effort with Agents

Add code
Oct 30, 2025
Viaarxiv icon

Beyond Memorization: Mapping the Originality-Quality Frontier of Language Models

Add code
Apr 13, 2025
Viaarxiv icon

The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

Add code
Apr 03, 2024
Figure 1 for The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers
Figure 2 for The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers
Figure 3 for The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers
Figure 4 for The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers
Viaarxiv icon