Abstract:Group-relative RL training (GRPO) samples a small group of parallel rollouts for every training prompt and uses their within-group reward spread to compute per-trajectory advantages. In agentic environments each rollout is a long multi-turn dialogue with one LLM call per step, so this multi-sample multiplier dominates the total training cost. When every rollout of a prompt ends with the same reward, the group has zero reward variance and contributes no gradient, so the extra rollouts add no information; such groups are common in practice (typically around 40% of all groups), so the wasted-compute fraction is substantial rather than marginal. Existing methods filter such groups at the prompt level, either after their rollouts are paid for or before any rollout begins, but both decide without using information that becomes available during the rollout itself. We instead ask whether the in-group divergence between the partial trajectories at an intermediate step can already predict that the group will be zero-variance: when the parallel rollouts have already converged on the same action prefix, the group is on track to produce a single reward, and we can stop early. We propose a one-parameter gate that stops a group when the mean pairwise prefix edit distance between its partial action sequences falls below a threshold. On a 60-iteration on-policy GRPO run on ALFWorld with Qwen2.5-7B, averaged over four random seeds, the gated arm finishes 10.7% faster in wall-clock (bootstrap 95% CI excludes 0) and shifts held-out success rate on 50 unseen tasks by +2.5 pp, with the held-out gain tracing to a measurable reduction in zero-advantage gradient-batch dilution. Code is available at https://github.com/zhiyuanZhai20/selective-rollout.
Abstract:Current LLM agents operate under an implicit but universal assumption: execution is a transaction -- the user submits a request, the agent works in isolation, and only upon completion does the dialogue resume. This forces users into a binary choice: wait for a potentially incorrect output, or interrupt and lose all progress. We reject this assumption and propose the stream paradigm, in which agent execution and user intervention are concurrent, interleaved processes sharing a bidirectional channel. We formalize this paradigm through a reversibility taxonomy that classifies every agent action as Idempotent, Reversible, Compensable, or Irreversible, and arrive at a core conclusion: an agent's flexibility is bounded by its reversibility. We prove that conflicting compensable actions impose unavoidable adaptation costs and that conflicting irreversible actions make full specification satisfaction impossible -- these costs are properties of the action space, not of the algorithm. Guided by this insight, we present the Revision Absorber, a reactive algorithm based on the Earliest-Conflict Rollback rule that is structurally optimal under mild assumptions. Experiments on StreamBench with real LLM agents validate all predictions: the Absorber matches the quality of a brute-force full-restart baseline while wasting an order of magnitude fewer steps of already-completed work, turning mid-execution revisions from a dead-end into a first-class interaction.
Abstract:Test-time compute scaling, the practice of spending extra computation during inference via repeated sampling, search, or extended reasoning, has become a powerful lever for improving large language model performance. Yet deploying these techniques under finite inference budgets requires a decision that current systems largely ignore: which inputs deserve more compute, and which can be answered cheaply? We formalize this as a constrained optimization problem (maximize expected accuracy subject to an average compute budget) and solve it with a two-stage Solve-then-Learn pipeline. In the solve stage, Lagrangian relaxation decomposes the global constraint into per-instance sub-problems, each admitting a closed-form oracle action that optimally prices accuracy against cost. We prove that the induced cost is monotone in the dual variable, enabling exact budget targeting via binary search. In the learn stage, a lightweight classifier is trained to predict oracle actions from cheap input features, amortizing the allocation rule for real-time deployment. We establish that the task-level regret of the learned policy is bounded by its imitation error times the worst-case per-instance gap, yielding a clean reduction from constrained inference to supervised classification. Experiments on MATH and GSM8K with three LLMs (DeepSeek-V3, GPT-4o-mini, Qwen2.5-7B) show that our method consistently outperforms uniform and heuristic allocation baselines, achieving up to 12.8% relative accuracy improvement on MATH under matched budget constraints, while closely tracking the Lagrangian oracle upper bound with over 91% imitation accuracy.
Abstract:Does reinforcement learning genuinely expand what LLM agents can do, or merely make them more reliable? For static reasoning, recent work answers the second: base and RL pass@k curves converge at large k. We ask whether this holds for agentic tool use, where T rounds of interaction enable compositional strategies that re-sampling cannot recover. We introduce PASS@(k,T), a two-dimensional metric that jointly varies sampling budget k and interaction depth T, separating capability expansion from efficiency improvement. Our main finding is that, contrary to the static-reasoning result, tool-use RL genuinely enlarges the capability boundary: the RL agent's pass-curve pulls above the base model's and the gap widens at large k rather than converging. The expansion is specific to compositional, sequential information gathering; on simpler tasks RL behaves as prior work predicts. Under matched training data, supervised fine-tuning regresses the boundary on the same compositional tasks, isolating self-directed exploration as the causal factor. Mechanism analysis shows RL reweights the base strategy distribution toward the subset whose downstream reasoning more often yields a correct answer, with the improvement concentrated on how the agent integrates retrieved information. These results reconcile optimistic and pessimistic readings of RL for LLMs: both are correct, on different task types.
Abstract:While distributed learning offers a new learning paradigm for distributed network with no central coordination, it is constrained by communication bottleneck between nodes. We develop a new event-triggered gossip framework for distributed learning to reduce inter-node communication overhead. The framework introduces an adaptive communication control mechanism that enables each node to autonomously decide in a fully decentralized fashion when to exchange model information with its neighbors based on local model deviations. We analyze the ergodic convergence of the proposed framework under noconvex objectives and interpret the convergence guarantees under different triggering conditions. Simulation results show that the proposed framework achieves substantially lower communication overhead than the state-of-the-art distributed learning methods, reducing cumulative point-to-point transmissions by \textbf{71.61\%} with only a marginal performance loss, compared with the conventional full-communication baseline.
Abstract:Low-altitude economy (LAE) is rapidly emerging as a key driver of innovation, encompassing economic activities taking place in airspace below 500 meters. Unmanned aerial vehicles (UAVs) provide valuable tools for logistics collection within LAE systems, offering the ability to navigate through complex environments, avoid obstacles, and improve operational efficiency. However, logistics collection tasks involve UAVs flying through complex three-dimensional (3D) environments while avoiding obstacles, where traditional UAV trajectory design methods,typically developed under free-space conditions without explicitly accounting for obstacles, are not applicable. This paper presents, we propose a novel algorithm that combines the Lin-Kernighan-Helsgaun (LKH) and Deep Deterministic Policy Gradient (DDPG) methods to minimize the total collection time. Specifically, the LKH algorithm determines the optimal order of item collection, while the DDPG algorithm designs the flight trajectory between collection points. Simulations demonstrate that the proposed LKH-DDPG algorithm significantly reduces collection time by approximately 49 percent compared to baseline approaches, thereby highlighting its effectiveness in optimizing UAV trajectories and enhancing operational efficiency for logistics collection tasks in the LAE paradigm.
Abstract:Edge AI, which brings artificial intelligence to the edge of the network for real-time processing and decision-making, has emerged as a transformative technology across various applications. However, the deployment of Edge AI systems faces significant challenges due to high energy consumption and extended operation time. In this paper, we consider an Edge AI system which integrates the data acquisition, computation and communication processes, and focus on improving learning performance of this system. We model the time and energy consumption of different processes and perform a rigorous convergence analysis to quantify the impact of key system parameters, such as the amount of collected data and the number of training rounds, on the learning performance. Based on this analysis, we formulate a system-wide optimization problem that seeks to maximize learning performance under given time and energy constraints. We explore both homogeneous and heterogeneous device scenarios, developing low-complexity algorithms based on one-dimensional search and alternating optimization to jointly optimize data collection time and training rounds. Simulation results validate the accuracy of our convergence analysis and demonstrate the effectiveness of the proposed algorithms, providing valuable insights into designing energy-efficient Edge AI systems under real-world conditions.
Abstract:To exploit unprecedented data generation in mobile edge networks, federated learning (FL) has emerged as a promising alternative to the conventional centralized machine learning (ML). However, there are some critical challenges for FL deployment. One major challenge called straggler issue severely limits FL's coverage where the device with the weakest channel condition becomes the bottleneck of the model aggregation performance. Besides, the huge uplink communication overhead compromises the effectiveness of FL, which is particularly pronounced in large-scale systems. To address the straggler issue, we propose the integration of an unmanned aerial vehicle (UAV) as the parameter server (UAV-PS) to coordinate the FL implementation. We further employ over-the-air computation technique that leverages the superposition property of wireless channels for efficient uplink communication. Specifically, in this paper, we develop a novel UAV-enabled over-the-air asynchronous FL (UAV-AFL) framework which supports the UAV-PS in updating the model continuously to enhance the learning performance. Moreover, we conduct a convergence analysis to quantitatively capture the impact of model asynchrony, device selection and communication errors on the UAV-AFL learning performance. Based on this, a unified communication-learning problem is formulated to maximize asymptotical learning performance by optimizing the UAV-PS trajectory, device selection and over-the-air transceiver design. Simulation results demonstrate that the proposed scheme achieves substantially learning efficiency improvement compared with the state-of-the-art approaches.
Abstract:Decentralized federated learning (DFL), inherited from distributed optimization, is an emerging paradigm to leverage the explosively growing data from wireless devices in a fully distributed manner.DFL enables joint training of machine learning model under device to device (D2D) communication fashion without the coordination of a parameter server. However, the deployment of wireless DFL is facing some pivotal challenges. Communication is a critical bottleneck due to the required extensive message exchange between neighbor devices to share the learned model. Besides, consensus becomes increasingly difficult as the number of devices grows because there is no available central server to perform coordination. To overcome these difficulties, this paper proposes employing over-the-air computation (Aircomp) to improve communication efficiency by exploiting the superposition property of analog waveform in multi-access channels, and introduce the mixing matrix mechanism to promote consensus using the spectral property of symmetric doubly stochastic matrix. Specifically, we develop a novel multiple-input multiple-output over-the-air DFL (MIMO OA-DFL) framework to study over-the-air DFL problem over MIMO multiple access channels. We conduct a general convergence analysis to quantitatively capture the influence of aggregation weight and communication error on the MIMO OA-DFL performance in \emph{ad hoc} networks. The result shows that the communication error together with the spectral gap of mixing matrix has a significant impact on the learning performance. Based on this, a joint communication-learning optimization problem is formulated to optimize transceiver beamformers and mixing matrix. Extensive numerical experiments are performed to reveal the characteristics of different topologies and demonstrate the substantial learning performance enhancement of our proposed algorithm.


Abstract:Unmanned aerial vehicle (UAV) and reconfigurable intelligent surface (RIS) have been recently applied in the field of mobile edge computing (MEC) to improve the data exchange environment by proactively changing the wireless channels through maneuverable location deployment and intelligent signals reflection, respectively. Nevertheless, they may suffer from inherent limitations in practical scenarios. UAV-mounted RIS (U-RIS), as a promising integrated approach, can combine the advantages of UAV and RIS to break the limit. Inspired by this, we consider a novel U-RIS assisted MEC system, where a U-RIS is deployed to assist the communication between the ground users and an MEC server. The joint UAV trajectory, RIS passive beamforming and MEC resource allocation design is developed to maximize the energy efficiency (EE) of the system. To tackle the intractable non-convex problem, we divide it into two subproblems and solve them iteratively based on successive convex approximation (SCA) and the Dinkelbach method. Finally we obtain a high-performance suboptimal solution. Simulation results show that the proposed algorithm significantly improves the energy efficiency of the MEC system.