Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kexun Zhang

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Apr 19, 2026

Ivan Bercovich, Ivgeni Segal, Kexun Zhang, Shashwat Saxena, Aditi Raghunathan, Ziqian Zhong

Abstract:We release Terminal Wrench, a subset of 331 terminal-agent benchmark environments, copied from the popular open benchmarks that are demonstrably reward-hackable. The data set includes 3,632 hack trajectories and 2,352 legitimate baseline trajectories across three frontier models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4). Each entry preserves the original task definition alongside full attack trajectories that show how the verifier was bypassed. It also includes cases where the task was not solved as intended. The tasks span system administration, machine learning, software engineering, and security challenges; the exploits range from simple output spoofing to stack-frame introspection, standard-library patching, and rootkit-style binary hijacking. Crucially, these exploits are specific to each task, rather than the evaluation harness, making them harder to patch. We also present a monitorability study in which hack trajectories are sanitized or stripped of reasoning traces and then scored by an LLM judge, showing that detection degrades meaningfully when chain-of-thought is removed (AUC drops from 0.97 to 0.92). The data set is publicly available at https://github.com/few-sh/terminal-wrench.

Via

Access Paper or Ask Questions

HardTests: Synthesizing High-Quality Test Cases for LLM Coding

May 30, 2025

Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, Lei Li

Abstract:Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.

Via

Access Paper or Ask Questions

Programming by Examples Meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction

Jan 27, 2025

Atharva Naik, Darsh Agrawal, Hong Sng, Clayton Marr, Kexun Zhang, Nathaniel R Robinson, Kalvin Chang, Rebecca Byrnes, Aravind Mysore, Carolyn Rose(+1 more)

Figure 1 for Programming by Examples Meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction

Figure 2 for Programming by Examples Meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction

Figure 3 for Programming by Examples Meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction

Figure 4 for Programming by Examples Meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction

Abstract:Historical linguists have long written "programs" that convert reconstructed words in an ancestor language into their attested descendants via ordered string rewrite functions (called sound laws) However, writing these programs is time-consuming, motivating the development of automated Sound Law Induction (SLI) which we formulate as Programming by Examples (PBE) with Large Language Models (LLMs) in this paper. While LLMs have been effective for code generation, recent work has shown that PBE is challenging but improvable by fine-tuning, especially with training data drawn from the same distribution as evaluation data. In this paper, we create a conceptual framework of what constitutes a "similar distribution" for SLI and propose four kinds of synthetic data generation methods with varying amounts of inductive bias to investigate what leads to the best performance. Based on the results we create a SOTA open-source model for SLI as PBE (+6% pass rate with a third of the parameters of the second-best LLM) and also highlight exciting future directions for PBE research.

Via

Access Paper or Ask Questions

Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students

Nov 27, 2024

Tiffany Zhu, Kexun Zhang, William Yang Wang

Figure 1 for Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students

Figure 2 for Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students

Figure 3 for Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students

Figure 4 for Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students

Abstract:The impressive essay writing and problem-solving capabilities of large language models (LLMs) like OpenAI's ChatGPT have opened up new avenues in education. Our goal is to gain insights into the widespread use of LLMs among secondary students to inform their future development. Despite school restrictions, our survey of over 300 middle and high school students revealed that a remarkable 70% of students have utilized LLMs, higher than the usage percentage among young adults, and this percentage remains consistent across 7th to 12th grade. Students also reported using LLMs for multiple subjects, including language arts, history, and math assignments, but expressed mixed thoughts on their effectiveness due to occasional hallucinations in historical contexts and incorrect answers for lack of rigorous reasoning. The survey feedback called for LLMs better adapted for students, and also raised questions to developers and educators on how to help students from underserved communities leverage LLMs' capabilities for equal access to advanced education resources. We propose a few ideas to address such issues, including subject-specific models, personalized learning, and AI classrooms.

* 6 main pages, 5 figures

Via

Access Paper or Ask Questions

SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement

Oct 29, 2024

Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, William Wang

Figure 1 for SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement

Figure 2 for SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement

Figure 3 for SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement

Figure 4 for SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement

Abstract:Software engineers operating in complex and dynamic environments must continuously adapt to evolving requirements, learn iteratively from experience, and reconsider their approaches based on new insights. However, current large language model (LLM)-based software agents often rely on rigid processes and tend to repeat ineffective actions without the capacity to evaluate their performance or adapt their strategies over time. To address these challenges, we propose SWE-Search, a multi-agent framework that integrates Monte Carlo Tree Search (MCTS) with a self-improvement mechanism to enhance software agents' performance on repository-level software tasks. SWE-Search extends traditional MCTS by incorporating a hybrid value function that leverages LLMs for both numerical value estimation and qualitative evaluation. This enables self-feedback loops where agents iteratively refine their strategies based on both quantitative numerical evaluations and qualitative natural language assessments of pursued trajectories. The framework includes a SWE-Agent for adaptive exploration, a Value Agent for iterative feedback, and a Discriminator Agent that facilitates multi-agent debate for collaborative decision-making. Applied to the SWE-bench benchmark, our approach demonstrates a 23% relative improvement in performance across five models compared to standard open-source agents without MCTS. Our analysis reveals how performance scales with increased search depth and identifies key factors that facilitate effective self-evaluation in software agents. This work highlights the potential of self-evaluation driven search techniques to enhance agent reasoning and planning in complex, dynamic software engineering environments.

* Main body: 10 pages, 5 figures. Appendix: 5 pages, 4 figures. Open-source codebase

Via

Access Paper or Ask Questions

Scaling LLM Inference with Optimized Sample Compute Allocation

Oct 29, 2024

Kexun Zhang, Shang Zhou, Danqing Wang, William Yang Wang, Lei Li

Figure 1 for Scaling LLM Inference with Optimized Sample Compute Allocation

Figure 2 for Scaling LLM Inference with Optimized Sample Compute Allocation

Figure 3 for Scaling LLM Inference with Optimized Sample Compute Allocation

Figure 4 for Scaling LLM Inference with Optimized Sample Compute Allocation

Abstract:Sampling is a basic operation in many inference-time algorithms of large language models (LLMs). To scale up inference efficiently with a limited compute, it is crucial to find an optimal allocation for sample compute budgets: Which sampling configurations (model, temperature, language, etc.) do we use? How many samples do we generate in each configuration? We formulate these choices as a learning problem and propose OSCA, an algorithm that Optimizes Sample Compute Allocation by finding an optimal mix of different inference configurations. Our experiments show that with our learned mixed allocation, we can achieve accuracy better than the best single configuration with 128x less compute on code generation and 25x less compute on 4 reasoning tasks. OSCA is also shown to be effective in agentic workflows beyond single-turn tasks, achieving a better accuracy on SWE-Bench with 3x less compute than the default configuration. Our code and generations are released at https://github.com/LeiLiLab/OSCA.

Via

Access Paper or Ask Questions

Revealing the Barriers of Language Agents in Planning

Oct 16, 2024

Jian Xie, Kexun Zhang, Jiangjie Chen, Siyu Yuan, Kai Zhang, Yikai Zhang, Lei Li, Yanghua Xiao

Figure 1 for Revealing the Barriers of Language Agents in Planning

Figure 2 for Revealing the Barriers of Language Agents in Planning

Figure 3 for Revealing the Barriers of Language Agents in Planning

Figure 4 for Revealing the Barriers of Language Agents in Planning

Abstract:Autonomous planning has been an ongoing pursuit since the inception of artificial intelligence. Based on curated problem solvers, early planning agents could deliver precise solutions for specific tasks but lacked generalization. The emergence of large language models (LLMs) and their powerful reasoning capabilities has reignited interest in autonomous planning by automatically generating reasonable solutions for given tasks. However, prior research and our experiments show that current language agents still lack human-level planning abilities. Even the state-of-the-art reasoning model, OpenAI o1, achieves only 15.6% on one of the complex real-world planning benchmarks. This highlights a critical question: What hinders language agents from achieving human-level planning? Although existing studies have highlighted weak performance in agent planning, the deeper underlying issues and the mechanisms and limitations of the strategies proposed to address them remain insufficiently understood. In this work, we apply the feature attribution study and identify two key factors that hinder agent planning: the limited role of constraints and the diminishing influence of questions. We also find that although current strategies help mitigate these challenges, they do not fully resolve them, indicating that agents still have a long way to go before reaching human-level intelligence.

* Work in Progress

Via

Access Paper or Ask Questions

Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

Aug 13, 2024

Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhiwei Liu, Rithesh Murthy, Tian Lan, Lei Li, Renze Lou, Jiacheng Xu(+6 more)

Abstract:Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems. The most advanced open-source SWE agent can resolve over 27% of real GitHub issues in SWE-Bench Lite. However, these sophisticated agent frameworks exhibit varying strengths, excelling in certain tasks while underperforming in others. To fully harness the diversity of these agents, we propose DEI (Diversity Empowered Intelligence), a framework that leverages their unique expertise. DEI functions as a meta-module atop existing SWE agent frameworks, managing agent collectives for enhanced problem-solving. Experimental results show that a DEI-guided committee of agents is able to surpass the best individual agent's performance by a large margin. For instance, a group of open-source SWE agents, with a maximum individual resolve rate of 27.3% on SWE-Bench Lite, can achieve a 34.3% resolve rate with DEI, making a 25% improvement and beating most closed-source solutions. Our best-performing group excels with a 55% resolve rate, securing the highest ranking on SWE-Bench Lite. Our findings contribute to the growing body of research on collaborative AI systems and their potential to solve complex software engineering challenges.

Via

Access Paper or Ask Questions

Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

Jul 20, 2024

Antonis Antoniades, Xinyi Wang, Yanai Elazar, Alfonso Amayuelas, Alon Albalak, Kexun Zhang, William Yang Wang

Figure 1 for Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

Figure 2 for Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

Figure 3 for Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

Figure 4 for Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data

Abstract:Despite the proven utility of large language models (LLMs) in real-world applications, there remains a lack of understanding regarding how they leverage their large-scale pretraining text corpora to achieve such capabilities. In this work, we investigate the interplay between generalization and memorization in pretrained LLMs at scale, through a comprehensive $n$-gram analysis of their training data. Our experiments focus on three general task types: translation, question-answering, and multiple-choice reasoning. With various sizes of open-source LLMs and their pretraining corpora, we observe that as the model size increases, the task-relevant $n$-gram pair data becomes increasingly important, leading to improved task performance, decreased memorization, stronger generalization, and emergent abilities. Our results support the hypothesis that LLMs' capabilities emerge from a delicate balance of memorization and generalization with sufficient task-related pretraining data, and point the way to larger-scale analyses that could further improve our understanding of these models.

* ICML FM-Wild workshop version

Via

Access Paper or Ask Questions

Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction

Jun 18, 2024

Atharva Naik, Kexun Zhang, Nathaniel Robinson, Aravind Mysore, Clayton Marr, Hong Sng Rebecca Byrnes, Anna Cai, Kalvin Chang, David Mortensen

Figure 1 for Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction

Figure 2 for Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction

Figure 3 for Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction

Figure 4 for Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction

Abstract:Historical linguists have long written a kind of incompletely formalized ''program'' that converts reconstructed words in an ancestor language into words in one of its attested descendants that consist of a series of ordered string rewrite functions (called sound laws). They do this by observing pairs of words in the reconstructed language (protoforms) and the descendent language (reflexes) and constructing a program that transforms protoforms into reflexes. However, writing these programs is error-prone and time-consuming. Prior work has successfully scaffolded this process computationally, but fewer researchers have tackled Sound Law Induction (SLI), which we approach in this paper by casting it as Programming by Examples. We propose a language-agnostic solution that utilizes the programming ability of Large Language Models (LLMs) by generating Python sound law programs from sound change examples. We evaluate the effectiveness of our approach for various LLMs, propose effective methods to generate additional language-agnostic synthetic data to fine-tune LLMs for SLI, and compare our method with existing automated SLI methods showing that while LLMs lag behind them they can complement some of their weaknesses.

Via

Access Paper or Ask Questions