Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Derek Chong

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

May 11, 2026

Simon Yu, Derek Chong, Ananjan Nandi, Dilara Soylu, Jiuding Sun, Christopher D Manning, Weiyan Shi

Abstract:We introduce Shepherd, a functional programming model that formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean. Shepherd records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed. The system forks the agent process and its filesystem $5\times$ faster than Docker, achieving $>95\%$ prompt-cache reuse on replay. We demonstrate the model through three applications. First, in runtime intervention, a live supervisor increases pair coding pass rates from 28.8% to 54.7% on CooperBench. Second, in counterfactual meta-optimization, branching exploration outperforms baselines across four benchmarks by up to 11 points while reducing wall-clock time by up to 58%. Third, in Tree-RL training, forking rollouts at selected turns improves TerminalBench-2 performance from 34.2% to 39.4%. These results establish Shepherd as an efficient infrastructure for programming meta-agents. We open-source the system to support future research.

* 56 pages, 21 figures, 14 tables

Via

Access Paper or Ask Questions

Detecting Label Errors using Pre-Trained Language Models

May 25, 2022

Derek Chong, Jenny Hong, Christopher D. Manning

Figure 1 for Detecting Label Errors using Pre-Trained Language Models

Figure 2 for Detecting Label Errors using Pre-Trained Language Models

Figure 3 for Detecting Label Errors using Pre-Trained Language Models

Figure 4 for Detecting Label Errors using Pre-Trained Language Models

Abstract:We show that large pre-trained language models are extremely capable of identifying label errors in datasets: simply verifying data points in descending order of out-of-distribution loss significantly outperforms more complex mechanisms for detecting label errors on natural language datasets. We contribute a novel method to produce highly realistic, human-originated label noise from crowdsourced data, and demonstrate the effectiveness of this method on TweetNLP, providing an otherwise difficult to obtain measure of realistic recall.

Via

Access Paper or Ask Questions