I2M, FRESNEL, TCLS, AMU
Abstract:Traditional customer support systems, such as Interactive Voice Response (IVR), rely on rigid scripts and lack the flexibility required for handling complex, policy-driven tasks. While large language model (LLM) agents offer a promising alternative, evaluating their ability to act in accordance with business rules and real-world support workflows remains an open challenge. Existing benchmarks primarily focus on tool usage or task completion, overlooking an agent's capacity to adhere to multi-step policies, navigate task dependencies, and remain robust to unpredictable user or environment behavior. In this work, we introduce JourneyBench, a benchmark designed to assess policy-aware agents in customer support. JourneyBench leverages graph representations to generate diverse, realistic support scenarios and proposes the User Journey Coverage Score, a novel metric to measure policy adherence. We evaluate multiple state-of-the-art LLMs using two agent designs: a Static-Prompt Agent (SPA) and a Dynamic-Prompt Agent (DPA) that explicitly models policy control. Across 703 conversations in three domains, we show that DPA significantly boosts policy adherence, even allowing smaller models like GPT-4o-mini to outperform more capable ones like GPT-4o. Our findings demonstrate the importance of structured orchestration and establish JourneyBench as a critical resource to advance AI-driven customer support beyond IVR-era limitations.
Abstract:Tracking multiple particles in noisy and cluttered scenes remains challenging due to a combinatorial explosion of trajectory hypotheses, which scales super-exponentially with the number of particles and frames. The transformer architecture has shown a significant improvement in robustness against this high combinatorial load. However, its performance still falls short of the conventional Bayesian filtering approaches in scenarios presenting a reduced set of trajectory hypothesis. This suggests that while transformers excel at narrowing down possible associations, they may not be able to reach the optimality of the Bayesian approach in locally sparse scenario. Hence, we introduce a hybrid tracking framework that combines the ability of self-attention to learn the underlying representation of particle behavior with the reliability and interpretability of Bayesian filtering. We perform trajectory-to-detection association by solving a label prediction problem, using a transformer encoder to infer soft associations between detections across frames. This prunes the hypothesis set, enabling efficient multiple-particle tracking in Bayesian filtering framework. Our approach demonstrates improved tracking accuracy and robustness against spurious detections, offering a solution for high clutter multiple particle tracking scenarios.