Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrew Schwartz

Open-World Evaluations for Measuring Frontier AI Capabilities

May 19, 2026

Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J. J. Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois, Gillian K Hadfield, Andrew B. Hall(+8 more)

Abstract:Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

Via

Access Paper or Ask Questions

FareShare: A Tool for Labor Organizers to Estimate Lost Wages and Contest Arbitrary AI and Algorithmic Deactivations

May 13, 2025

Varun Nagaraj Rao, Samantha Dalal, Andrew Schwartz, Amna Liaqat, Dana Calacci, Andrés Monroy-Hernández

Figure 1 for FareShare: A Tool for Labor Organizers to Estimate Lost Wages and Contest Arbitrary AI and Algorithmic Deactivations

Figure 2 for FareShare: A Tool for Labor Organizers to Estimate Lost Wages and Contest Arbitrary AI and Algorithmic Deactivations

Figure 3 for FareShare: A Tool for Labor Organizers to Estimate Lost Wages and Contest Arbitrary AI and Algorithmic Deactivations

Figure 4 for FareShare: A Tool for Labor Organizers to Estimate Lost Wages and Contest Arbitrary AI and Algorithmic Deactivations

Abstract:What happens when a rideshare driver is suddenly locked out of the platform connecting them to riders, wages, and daily work? Deactivation-the abrupt removal of gig workers' platform access-typically occurs through arbitrary AI and algorithmic decisions with little explanation or recourse. This represents one of the most severe forms of algorithmic control and often devastates workers' financial stability. Recent U.S. state policies now mandate appeals processes and recovering compensation during the period of wrongful deactivation based on past earnings. Yet, labor organizers still lack effective tools to support these complex, error-prone workflows. We designed FareShare, a computational tool automating lost wage estimation for deactivated drivers, through a 6 month partnership with the State of Washington's largest rideshare labor union. Over the following 3 months, our field deployment of FareShare registered 178 account signups. We observed that the tool could reduce lost wage calculation time by over 95%, eliminate manual data entry errors, and enable legal teams to generate arbitration-ready reports more efficiently. Beyond these gains, the deployment also surfaced important socio-technical challenges around trust, consent, and tool adoption in high-stakes labor contexts.

Via

Access Paper or Ask Questions