Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kyoung Whan Choe

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Apr 06, 2026

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng(+5 more)

Abstract:Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39-64% but exhibit unsafe action rates of 7-33%. On OpenClaw, the top five models fall within a 10 percentage-point band on task success (53-63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi-step sandbox escalation and silent contract modification.

* 25 pages, 5 figures

Via

Access Paper or Ask Questions

Results of the NeurIPS 2023 Neural MMO Competition on Multi-task Reinforcement Learning

Aug 17, 2025

Joseph Suárez, Kyoung Whan Choe, David Bloomin, Jianming Gao, Yunkun Li, Yao Feng, Saidinesh Pola, Kun Zhang, Yonghui Zhu, Nikhil Pinnaparaju(+15 more)

Abstract:We present the results of the NeurIPS 2023 Neural MMO Competition, which attracted over 200 participants and submissions. Participants trained goal-conditional policies that generalize to tasks, maps, and opponents never seen during training. The top solution achieved a score 4x higher than our baseline within 8 hours of training on a single 4090 GPU. We open-source everything relating to Neural MMO and the competition under the MIT license, including the policy weights and training code for our baseline and for the top submissions.

Via

Access Paper or Ask Questions

Massively Multiagent Minigames for Training Generalist Agents

Jun 07, 2024

Kyoung Whan Choe, Ryan Sullivan, Joseph Suárez

Figure 1 for Massively Multiagent Minigames for Training Generalist Agents

Figure 2 for Massively Multiagent Minigames for Training Generalist Agents

Figure 3 for Massively Multiagent Minigames for Training Generalist Agents

Figure 4 for Massively Multiagent Minigames for Training Generalist Agents

Abstract:We present Meta MMO, a collection of many-agent minigames for use as a reinforcement learning benchmark. Meta MMO is built on top of Neural MMO, a massively multiagent environment that has been the subject of two previous NeurIPS competitions. Our work expands Neural MMO with several computationally efficient minigames. We explore generalization across Meta MMO by learning to play several minigames with a single set of weights. We release the environment, baselines, and training code under the MIT license. We hope that Meta MMO will spur additional progress on Neural MMO and, more generally, will serve as a useful benchmark for many-agent generalization.

Via

Access Paper or Ask Questions

Neural MMO 2.0: A Massively Multi-task Addition to Massively Multi-agent Learning

Nov 07, 2023

Joseph Suárez, Phillip Isola, Kyoung Whan Choe, David Bloomin, Hao Xiang Li, Nikhil Pinnaparaju, Nishaanth Kanna, Daniel Scott, Ryan Sullivan, Rose S. Shuman(+8 more)

Figure 1 for Neural MMO 2.0: A Massively Multi-task Addition to Massively Multi-agent Learning

Figure 2 for Neural MMO 2.0: A Massively Multi-task Addition to Massively Multi-agent Learning

Abstract:Neural MMO 2.0 is a massively multi-agent environment for reinforcement learning research. The key feature of this new version is a flexible task system that allows users to define a broad range of objectives and reward signals. We challenge researchers to train agents capable of generalizing to tasks, maps, and opponents never seen during training. Neural MMO features procedurally generated maps with 128 agents in the standard setting and support for up to. Version 2.0 is a complete rewrite of its predecessor with three-fold improved performance and compatibility with CleanRL. We release the platform as free and open-source software with comprehensive documentation available at neuralmmo.github.io and an active community Discord. To spark initial research on this new platform, we are concurrently running a competition at NeurIPS 2023.

Via

Access Paper or Ask Questions