Abstract:AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .
Abstract:Current evaluation of web agents largely reduces to binary success metrics or conformity to a single reference trajectory, ignoring the structural diversity present in benchmark datasets. We present WebGraphEval, a framework that abstracts trajectories from multiple agents into a unified, weighted action graph. This representation is directly compatible with benchmarks such as WebArena, leveraging leaderboard runs and newly collected trajectories without modifying environments. The framework canonically encodes actions, merges recurring behaviors, and applies structural analyses including reward propagation and success-weighted edge statistics. Evaluations across thousands of trajectories from six web agents show that the graph abstraction captures cross-model regularities, highlights redundancy and inefficiency, and identifies critical decision points overlooked by outcome-based metrics. By framing web interaction as graph-structured data, WebGraphEval establishes a general methodology for multi-path, cross-agent, and efficiency-aware evaluation of web agents.



Abstract:Federated Learning(FL) is a privacy-preserving machine learning paradigm where a global model is trained in-situ across a large number of distributed edge devices. These systems are often comprised of millions of user devices and only a subset of available devices can be used for training in each epoch. Designing a device selection strategy is challenging, given that devices are highly heterogeneous in both their system resources and training data. This heterogeneity makes device selection very crucial for timely model convergence and sufficient model accuracy. To tackle the FL client heterogeneity problem, various client selection algorithms have been developed, showing promising performance improvement in terms of model coverage and accuracy. In this work, we study the overhead of client selection algorithms in a large scale FL environment. Then we propose an efficient data distribution summary calculation algorithm to reduce the overhead in a real-world large scale FL environment. The evaluation shows that our proposed solution could achieve up to 30x reduction in data summary time, and up to 360x reduction in clustering time.
Abstract:This study introduces a novel approach for analyzing and modifying entity relationships in GPT models, diverging from ROME's entity-focused methods. We develop a relation tracing technique to understand the influence of language model computations on relationship judgments. Using the FewRel dataset, we identify key roles of MLP modules and attention mechanisms in processing relationship information. Our method, tested against ROME on a new dataset, shows improved balance in specificity and generalization, underscoring the potential of manipulating early-layer modules for enhanced model understanding and accuracy.