Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Segev Shlomov

Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis

Apr 12, 2026

Roi Ben-Gigi, Yuval David, Fabiana Fournier, Lior Limonad, Dany Moshkovich, Hadar Mulian, Segev Shlomov

Abstract:AI agent development relies heavily on natural language prompting to define agents' tasks, knowledge, and goals. These prompts are interpreted by Large Language Models (LLMs), which govern agent behavior. Consequently, agentic performance is susceptible to variability arising from imprecise or ambiguous prompt formulations. Identifying and correcting such issues requires examining not only the agent's code, but also the internal system prompts generated throughout its execution lifecycle, as reflected in execution logs. In this work, we introduce an analytics pipeline implemented as part of the Agent Mentor open-source library that monitors and incrementally adapts the system prompts defining another agent's behavior. The pipeline improves performance by systematically injecting corrective instructions into the agent's knowledge. We describe its underlying mechanism, with particular emphasis on identifying semantic features associated with undesired behaviors and using them to derive corrective statements. We evaluate the proposed pipeline across three exemplar agent configurations and benchmark tasks using repeated execution runs to assess effectiveness. These experiments provide an initial exploration of automating such a mentoring pipeline within future agentic governance frameworks. Overall, the approach demonstrates consistent and measurable accuracy improvements across diverse configurations, particularly in settings dominated by specification ambiguity. For reproducibility, we released our code as open source under the Agent Mentor library.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

General Agent Evaluation

Feb 26, 2026

Elron Bandel, Asaf Yehudai, Lilach Eden, Yehoshua Sagron, Yotam Perlitz, Elad Venezian, Natalia Razinkov, Natan Ergas, Shlomit Shachor Ifergan, Segev Shlomov(+5 more)

Abstract:The promise of general-purpose agents - systems that perform tasks in unfamiliar environments without domain-specific engineering - remains largely unrealized. Existing agents are predominantly specialized, and while emerging implementations like OpenAI SDK Agent and Claude Code hint at broader capabilities, no systematic evaluation of their general performance has been pursued. Current agentic benchmarks assume domain-specific integration, encoding task information in ways that preclude fair evaluation of general agents. This paper frames general-agent evaluation as a first-class research objective. We propose conceptual principles for such evaluation, a Unified Protocol enabling agent-benchmark integration, and Exgentic - a practical framework for general agent evaluation. We benchmark five prominent agent implementations across six environments as the first Open General Agent Leaderboard. Our experiments show that general agents generalize across diverse environments, achieving performance comparable to domain-specific agents without any environment-specific tuning. We release our evaluation protocol, framework, and leaderboard to establish a foundation for systematic research on general-purpose agents.

Via

Access Paper or Ask Questions

TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers

Feb 18, 2026

Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif, Segev Shlomov

Abstract:Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating, and verification. While convenient, this design makes deployments slow and expensive due to cumulative latency and token usage. We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces. TabAgent (i) extracts structured schema, state, and dependency features from trajectories (TabSchema), (ii) augments coverage with schema-aligned synthetic supervision (TabSynth), and (iii) scores candidates with a lightweight classifier (TabHead). On the long-horizon AppWorld benchmark, TabAgent maintains task-level success while eliminating shortlist-time LLM calls, reducing latency by approximately 95% and inference cost by 85-91%. Beyond tool shortlisting, TabAgent generalizes to other agentic decision heads, establishing a paradigm for learned discriminative replacements of generative bottlenecks in production agent architectures.

Via

Access Paper or Ask Questions

Textual Planning with Explicit Latent Transitions

Feb 04, 2026

Eliezer Shlomi, Ido Levy, Eilam Shapira, Michael Katz, Guy Uziel, Segev Shlomov, Nir Mashkif, Roi Reichart, Sarah Keren

Abstract:Planning with LLMs is bottlenecked by token-by-token generation and repeated full forward passes, making multi-step lookahead and rollout-based search expensive in latency and compute. We propose EmbedPlan, which replaces autoregressive next-state generation with a lightweight transition model operating in a frozen language embedding space. EmbedPlan encodes natural language state and action descriptions into vectors, predicts the next-state embedding, and retrieves the next state by nearest-neighbor similarity, enabling fast planning computation without fine-tuning the encoder. We evaluate next-state prediction across nine classical planning domains using six evaluation protocols of increasing difficulty: interpolation, plan-variant, extrapolation, multi-domain, cross-domain, and leave-one-out. Results show near-perfect interpolation performance but a sharp degradation when generalization requires transfer to unseen problems or unseen domains; plan-variant evaluation indicates generalization to alternative plans rather than memorizing seen trajectories. Overall, frozen embeddings support within-domain dynamics learning after observing a domain's transitions, while transfer across domain boundaries remains a bottleneck.

Via

Access Paper or Ask Questions

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Oct 10, 2024

Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov

Figure 1 for ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Figure 2 for ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Figure 3 for ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Figure 4 for ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Abstract:Recent advancements in LLM-based web agents have introduced novel architectures and benchmarks showcasing progress in autonomous web navigation and interaction. However, most existing benchmarks prioritize effectiveness and accuracy, overlooking crucial factors like safety and trustworthiness which are essential for deploying web agents in enterprise settings. The risks of unsafe web agent behavior, such as accidentally deleting user accounts or performing unintended actions in critical business operations, pose significant barriers to widespread adoption. In this paper, we present ST-WebAgentBench, a new online benchmark specifically designed to evaluate the safety and trustworthiness of web agents in enterprise contexts. This benchmark is grounded in a detailed framework that defines safe and trustworthy (ST) agent behavior, outlines how ST policies should be structured and introduces the Completion under Policies metric to assess agent performance. Our evaluation reveals that current SOTA agents struggle with policy adherence and cannot yet be relied upon for critical business applications. Additionally, we propose architectural principles aimed at improving policy awareness and compliance in web agents. We open-source this benchmark and invite the community to contribute, with the goal of fostering a new generation of safer, more trustworthy AI agents. All code, data, environment reproduction resources, and video demonstrations are available at https://sites.google.com/view/st-webagentbench/home.

Via

Access Paper or Ask Questions

From Grounding to Planning: Benchmarking Bottlenecks in Web Agents

Sep 03, 2024

Segev Shlomov, Ben wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol

Figure 1 for From Grounding to Planning: Benchmarking Bottlenecks in Web Agents

Figure 2 for From Grounding to Planning: Benchmarking Bottlenecks in Web Agents

Figure 3 for From Grounding to Planning: Benchmarking Bottlenecks in Web Agents

Figure 4 for From Grounding to Planning: Benchmarking Bottlenecks in Web Agents

Abstract:General web-based agents are increasingly essential for interacting with complex web environments, yet their performance in real-world web applications remains poor, yielding extremely low accuracy even with state-of-the-art frontier models. We observe that these agents can be decomposed into two primary components: Planning and Grounding. Yet, most existing research treats these agents as black boxes, focusing on end-to-end evaluations which hinder meaningful improvements. We sharpen the distinction between the planning and grounding components and conduct a novel analysis by refining experiments on the Mind2Web dataset. Our work proposes a new benchmark for each of the components separately, identifying the bottlenecks and pain points that limit agent performance. Contrary to prevalent assumptions, our findings suggest that grounding is not a significant bottleneck and can be effectively addressed with current techniques. Instead, the primary challenge lies in the planning component, which is the main source of performance degradation. Through this analysis, we offer new insights and demonstrate practical suggestions for improving the capabilities of web agents, paving the way for more reliable agents.

Via

Access Paper or Ask Questions

SNAP: Semantic Stories for Next Activity Prediction

Jan 28, 2024

Alon Oved, Segev Shlomov, Sergey Zeltyn, Nir Mashkif, Avi Yaeli

Figure 1 for SNAP: Semantic Stories for Next Activity Prediction

Figure 2 for SNAP: Semantic Stories for Next Activity Prediction

Figure 3 for SNAP: Semantic Stories for Next Activity Prediction

Figure 4 for SNAP: Semantic Stories for Next Activity Prediction

Abstract:Predicting the next activity in an ongoing process is one of the most common classification tasks in the business process management (BPM) domain. It allows businesses to optimize resource allocation, enhance operational efficiency, and aids in risk mitigation and strategic decision-making. This provides a competitive edge in the rapidly evolving confluence of BPM and AI. Existing state-of-the-art AI models for business process prediction do not fully capitalize on available semantic information within process event logs. As current advanced AI-BPM systems provide semantically-richer textual data, the need for novel adequate models grows. To address this gap, we propose the novel SNAP method that leverages language foundation models by constructing semantic contextual stories from the process historical event logs and using them for the next activity prediction. We compared the SNAP algorithm with nine state-of-the-art models on six benchmark datasets and show that SNAP significantly outperforms them, especially for datasets with high levels of semantic content.

Via

Access Paper or Ask Questions

Mimicking the Maestro: Exploring the Efficacy of a Virtual AI Teacher in Fine Motor Skill Acquisition

Oct 16, 2023

Hadar Mulian, Segev Shlomov, Lior Limonad

Figure 1 for Mimicking the Maestro: Exploring the Efficacy of a Virtual AI Teacher in Fine Motor Skill Acquisition

Figure 2 for Mimicking the Maestro: Exploring the Efficacy of a Virtual AI Teacher in Fine Motor Skill Acquisition

Figure 3 for Mimicking the Maestro: Exploring the Efficacy of a Virtual AI Teacher in Fine Motor Skill Acquisition

Figure 4 for Mimicking the Maestro: Exploring the Efficacy of a Virtual AI Teacher in Fine Motor Skill Acquisition

Abstract:Motor skills, especially fine motor skills like handwriting, play an essential role in academic pursuits and everyday life. Traditional methods to teach these skills, although effective, can be time-consuming and inconsistent. With the rise of advanced technologies like robotics and artificial intelligence, there is increasing interest in automating such teaching processes using these technologies, via human-robot and human-computer interactions. In this study, we examine the potential of a virtual AI teacher in emulating the techniques of human educators for motor skill acquisition. We introduce an AI teacher model that captures the distinct characteristics of human instructors. Using a Reinforcement Learning environment tailored to mimic teacher-learner interactions, we tested our AI model against four guiding hypotheses, emphasizing improved learner performance, enhanced rate of skill acquisition, and reduced variability in learning outcomes. Our findings, validated on synthetic learners, revealed significant improvements across all tested hypotheses. Notably, our model showcased robustness across different learners and settings and demonstrated adaptability to handwriting. This research underscores the potential of integrating Reinforcement Learning and Imitation Learning models with robotics in revolutionizing the teaching of critical motor skills.

* 17 pages, 3 figures

Via

Access Paper or Ask Questions

Enhancing Trust in LLM-Based AI Automation Agents: New Considerations and Future Challenges

Aug 10, 2023

Sivan Schwartz, Avi Yaeli, Segev Shlomov

Abstract:Trust in AI agents has been extensively studied in the literature, resulting in significant advancements in our understanding of this field. However, the rapid advancements in Large Language Models (LLMs) and the emergence of LLM-based AI agent frameworks pose new challenges and opportunities for further research. In the field of process automation, a new generation of AI-based agents has emerged, enabling the execution of complex tasks. At the same time, the process of building automation has become more accessible to business users via user-friendly no-code tools and training mechanisms. This paper explores these new challenges and opportunities, analyzes the main aspects of trust in AI agents discussed in existing literature, and identifies specific considerations and challenges relevant to this new generation of automation agents. We also evaluate how nascent products in this category address these considerations. Finally, we highlight several challenges that the research community should address in this evolving landscape.

* Accepted to the First International Workshop on the Future of No-Code Digital Apprentices

Via

Access Paper or Ask Questions

Prescriptive Process Monitoring in Intelligent Process Automation with Chatbot Orchestration

Dec 13, 2022

Sergey Zeltyn, Segev Shlomov, Avi Yaeli, Alon Oved

Figure 1 for Prescriptive Process Monitoring in Intelligent Process Automation with Chatbot Orchestration

Figure 2 for Prescriptive Process Monitoring in Intelligent Process Automation with Chatbot Orchestration

Figure 3 for Prescriptive Process Monitoring in Intelligent Process Automation with Chatbot Orchestration

Figure 4 for Prescriptive Process Monitoring in Intelligent Process Automation with Chatbot Orchestration

Abstract:Business processes that involve AI-powered automation have been gaining importance and market share in recent years. These business processes combine the characteristics of classical business process management, goal-driven chatbots, conversational recommendation systems, and robotic process automation. In the new context, prescriptive process monitoring demands innovative approaches. Unfortunately, data logs from these new processes are still not available in the public domain. We describe the main challenges in this new domain and introduce a synthesized dataset that is based on an actual use case of intelligent process automation with chatbot orchestration. Using this dataset, we demonstrate crowd-wisdom and goal-driven approaches to prescriptive process monitoring.

* IJCAI 2022 Workshop on Process Management in the AI era (PMAI)

Via

Access Paper or Ask Questions