Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yanwei Cui

Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

Jun 16, 2026

Yanwei Cui, Xing Zhang, Yulong Zhang, Li Shao, Xiaofeng Shi, Guanghui Wang, Peiyang He

Abstract:Training-free verbal reinforcement learning enables LLM agents to learn from world feedback -- objective signals such as dynamic task outcomes, market returns, or demand forecasts -- by extracting verbal rules from experience and injecting them as context, updating the agent's behavior without parameter changes. However, in non-stationary environments these agents face a retention-forgetting dilemma: retaining stale insights causes negative transfer, while discarding them causes catastrophic forgetting when conditions recur. We identify four requirements for navigating this dilemma -- outcome-driven evaluation, persistent structured evidence, non-monotonic knowledge lifecycle, and compositional governance -- and show that existing methods invest heavily in experience extraction while underinvesting in insight governance. We propose a three-layer architecture -- rules, evidence, and skills -- connected by a feedback-driven curation loop that closes the governance gap. Rules capture distilled experience from world outcomes; evidence logs track each rule's reliability across episodes; skills govern which rules to apply, how to resolve conflicts, and when to abstain. On financial forecasting as a case study, where world feedback is naturally abundant, noisy, and non-stationary, we show that the same accumulated experience either degrades performance below the zero-shot baseline or dramatically improves accuracy and risk-adjusted returns, depending on whether the curation loop is present.

* Accepted to the ICML 2026 RLxF: Reinforcement Learning from World Feedback Workshop, RLxF@ICML 2026, Seoul, South Korea

Via

Access Paper or Ask Questions

Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

May 21, 2026

Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He

Abstract:Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce \textbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.

* 16 pages, 2 figures, 6 tables. Extends arXiv:2605.19576 with the SWE-bench Verified evaluation and a non-divergence analysis (Proposition 1)

Via

Access Paper or Ask Questions

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

May 19, 2026

Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He

Abstract:Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

Via

Access Paper or Ask Questions

Hindsight Preference Optimization for Financial Time Series Advisory

Apr 27, 2026

Yanwei Cui, Guanghui Wang, Xing Zhang, Peiyang He, Ziyuan Li, Bing Zhu, Wei Qiu, Xusheng Wang, Zheng Yu, Anqi Xin

Abstract:Time series models predict numbers; decision-makers need advisory -- directional signals with reasoning, actionable suggestions, and risk management. Training language models for such predictive advisory faces a fundamental challenge: quality depends on outcomes unknown at prediction time. We bridge two ideas from reinforcement learning -- using information unavailable during execution to retrospectively generate training signal, and preference alignment -- and propose Hindsight Preference Optimization: observed outcomes let an LLM judge rank candidate advisories on dimensions that scalar metrics cannot capture, producing preference pairs for DPO without human annotation. We apply this to Vision-Language-Model-based predictive advisories on S&P 500 equity time series, demonstrated by a 4B model outperforming its 235B teacher on both accuracy and advisory quality.

* Accepted at ICLR 2026 TSALM Workshop

Via

Access Paper or Ask Questions

Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

Apr 16, 2026

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

Abstract:Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku (6 methods $\times$ 4 tasks $\times$ 3 repeats), 49% score below zero-shot; on Amazon Nova Lite, the failure rate is even higher. Yet on one task, all six methods improve over zero-shot by up to $+6.8$ points. What distinguishes success from failure? We investigate with 18,000 grid evaluations and 144 optimization runs, testing two assumptions behind end-to-end optimization tools like TextGrad and DSPy: (A) individual prompts are worth optimizing, and (B) agent prompts interact, requiring joint optimization. Interaction effects are never significant ($p > 0.52$, all $F < 1.0$), and optimization helps only when the task has exploitable output structure -- a format the model can produce but does not default to. We provide a two-stage diagnostic: an \$80 ANOVA pre-test for agent coupling, and a 10-minute headroom test that predicts whether optimization is worthwhile -- turning a coin flip into an informed decision.

Via

Access Paper or Ask Questions

Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents

Apr 13, 2026

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

Abstract:Developers increasingly guide AI coding agents through natural language instruction files (e.g., CLAUDE.md, .cursorrules), yet no controlled study has measured whether these rules actually improve agent performance or which properties make a rule beneficial. We scrape 679 such files (25,532 rules) from GitHub and conduct the first large-scale empirical evaluation, running over 5,000 agent runs with a state-of-the-art coding agent on SWE-bench Verified. Rules improve performance by 7--14 percentage points, but random rules help as much as expert-curated ones -- suggesting rules work through context priming rather than specific instruction. Negative constraints ("do not refactor unrelated code") are the only individually beneficial rule type, while positive directives ("follow code style") actively hurt -- a pattern we analyze through the lens of potential-based reward shaping (PBRS). Moreover, individual rules are mostly harmful in isolation yet collectively helpful, with no degradation up to 50 rules. These findings expose a hidden reliability risk -- well-intentioned rules routinely degrade agent performance -- and provide a clear principle for safe agent configuration: constrain what agents must not do, rather than prescribing what they should.

Via

Access Paper or Ask Questions

Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution

Mar 12, 2026

Xing Zhang, Yanwei Cui, Guanghui Wang, Qucy Wei Qiu, Ziyuan Li, Fangwei Han, Yajing Huang, Hengzhi Qiu, Bin Zhu, Peiyang He

Abstract:We present Verified Multi-Agent Orchestration (VMAO), a framework that coordinates specialized LLM-based agents through a verification-driven iterative loop. Given a complex query, our system decomposes it into a directed acyclic graph (DAG) of sub-questions, executes them through domain-specific agents in parallel, verifies result completeness via LLM-based evaluation, and adaptively replans to address gaps. The key contributions are: (1) dependency-aware parallel execution over a DAG of sub-questions with automatic context propagation, (2) verification-driven adaptive replanning that uses an LLM-based verifier as an orchestration-level coordination signal, and (3) configurable stop conditions that balance answer quality against resource usage. On 25 expert-curated market research queries, VMAO improves answer completeness from 3.1 to 4.2 and source quality from 2.6 to 4.1 (1-5 scale) compared to a single-agent baseline, demonstrating that orchestration-level verification is an effective mechanism for multi-agent quality assurance.

* ICLR 2026 Workshop on MALGAI

Via

Access Paper or Ask Questions

Understand customer reviews with less data and in short time: pretrained language representation and active learning

Oct 29, 2019

Yanwei Cui, Xavier Illy

Figure 1 for Understand customer reviews with less data and in short time: pretrained language representation and active learning

Figure 2 for Understand customer reviews with less data and in short time: pretrained language representation and active learning

Figure 3 for Understand customer reviews with less data and in short time: pretrained language representation and active learning

Figure 4 for Understand customer reviews with less data and in short time: pretrained language representation and active learning

Abstract:In this paper, we address customer review understanding problems by using supervised machine learning approaches, in order to achieve a fully automatic review aspects categorisation and sentiment analysis. In general, such supervised learning algorithms require domain-specific expert knowledge for generating high quality labeled training data, and the cost of labeling can be very high. To achieve an in-production customer review machine learning enabled analysis tool with only a limited amount of data and within a reasonable training data collection time, we propose to use pre-trained language representation to boost model performance and active learning framework for accelerating the iterative training process. The results show that with integration of both components, the fully automatic review analysis can be achieved at a much faster pace.

Via

Access Paper or Ask Questions

Modelling customer online behaviours with neural networks: applications to conversion prediction and advertising retargeting

Apr 20, 2018

Yanwei Cui, Rogatien Tobossi, Olivia Vigouroux

Figure 1 for Modelling customer online behaviours with neural networks: applications to conversion prediction and advertising retargeting

Figure 2 for Modelling customer online behaviours with neural networks: applications to conversion prediction and advertising retargeting

Figure 3 for Modelling customer online behaviours with neural networks: applications to conversion prediction and advertising retargeting

Figure 4 for Modelling customer online behaviours with neural networks: applications to conversion prediction and advertising retargeting

Abstract:In this paper, we apply neural networks into digital marketing world for the purpose of better targeting the potential customers. To do so, we model the customer online behaviours using dedicated neural network architectures. Starting from user searched keywords in a search engine to the landing page and different following pages, until the user left the site, we model the whole visited journey with a Recurrent Neural Network (RNN), together with Convolution Neural Networks (CNN) that can take into account of the semantic meaning of user searched keywords and different visited page names. With such model, we use Monte Carlo simulation to estimate the conversion rates of each potential customer in the future visiting. We believe our concept and the preliminary promising results in this paper enable the use of largely available customer online behaviours data for advanced digital marketing analysis.

Via

Access Paper or Ask Questions

Combining multiple resolutions into hierarchical representations for kernel-based image classification

Jul 12, 2016

Yanwei Cui, Sébastien Lefevre, Laetitia Chapel, Anne Puissant

Figure 1 for Combining multiple resolutions into hierarchical representations for kernel-based image classification

Figure 2 for Combining multiple resolutions into hierarchical representations for kernel-based image classification

Figure 3 for Combining multiple resolutions into hierarchical representations for kernel-based image classification

Figure 4 for Combining multiple resolutions into hierarchical representations for kernel-based image classification

Abstract:Geographic object-based image analysis (GEOBIA) framework has gained increasing interest recently. Following this popular paradigm, we propose a novel multiscale classification approach operating on a hierarchical image representation built from two images at different resolutions. They capture the same scene with different sensors and are naturally fused together through the hierarchical representation, where coarser levels are built from a Low Spatial Resolution (LSR) or Medium Spatial Resolution (MSR) image while finer levels are generated from a High Spatial Resolution (HSR) or Very High Spatial Resolution (VHSR) image. Such a representation allows one to benefit from the context information thanks to the coarser levels, and subregions spatial arrangement information thanks to the finer levels. Two dedicated structured kernels are then used to perform machine learning directly on the constructed hierarchical representation. This strategy overcomes the limits of conventional GEOBIA classification procedures that can handle only one or very few pre-selected scales. Experiments run on an urban classification task show that the proposed approach can highly improve the classification accuracy w.r.t. conventional approaches working on a single scale.

* International Conference on Geographic Object-Based Image Analysis (GEOBIA 2016), University of Twente in Enschede, The Netherlands

Via

Access Paper or Ask Questions