Department of Automation, Shanghai Jiao Tong University, Shanghai, China, Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China, Shanghai Engineering Research Center of Intelligent Control and Management, Shanghai, China
Abstract:Robotic Cellular Warehousing Systems (RCWS) give rise to multi-agent pickup and delivery (MAPD) processes in which robots sequentially collect multiple stock-keeping units (SKUs) for each order. Unlike classical MAPD formulations that assume static tasks, real warehouse operations often involve dynamic order evolution, where new SKUs may be appended to an order while it is being executed. Motivated by this practical requirement, this letter formulates the Dynamic Multi-Agent Pickup and Delivery problem considering internal order evolution for the first time. Building on the token passing paradigm, we propose two event-triggered online replanning algorithms. The first, Dynamic Token Passing, performs localized replanning upon order updates through add-order decomposition and priority-based token scheduling while preserving collision-free execution. The second, Cooperative Token Passing, further enables idle robots to opportunistically assist newly added pickups, improving system-level efficiency. Simulation results in RCWS environments demonstrate that the proposed methods significantly reduce order flowtime compared with static and non-cooperative baselines.
Abstract:We study the sample complexity of policy gradient for log-growth control -- the problem of learning, from observed state transitions, a feedback gain that optimally stabilizes a scalar linear system driven through a multiplicative-noise actuation channel. The objective $J(K) = \mathbb{E}[\log|1+BK|]$ is the top Lyapunov exponent of the closed loop. This problem carries a structural difficulty we call the cusp obstruction: the optimal gain $K^*$ always places the noise singularity $b_{\rm sing}(K) = -1/K$ in the interior of the support. At this singular optimum the policy gradient exists only as a Cauchy principal value, not as a Lebesgue integral, and the natural single-sample gradient estimator has infinite variance. Standard first-order stochastic-optimization analysis is thus inapplicable at the optimum, and merely smoothing the objective does not resolve the difficulty. The obstruction, however, has an exploitable symmetry: the Cauchy kernel is an odd function of the displacement from the moving pole, so pairing each observation with its reflection through the pole cancels the divergent part. This one cancellation simultaneously controls the population curvature, the gradient-estimator variance, and the bias incurred when the noise density is estimated. Combining these bounds with a closed-form single-transition gradient oracle, we prove that projected mini-batch policy gradient, initialized in any compact subset of the stabilizing region, attains total sample complexity $\tilde{O}(1/η)$ when the noise density is known and $\tilde{O}(η^{-(2s+1)/(2s)})$ when it must be estimated, for $C^s$ noise densities with $s \geq 2$.
Abstract:We study multi-robot persistent monitoring on weighted graphs, where node weights encode monitoring priorities and edge weights encode travel distances. The goal is to design joint robot trajectories that minimize the worst-case weighted latency across all nodes over an infinite time horizon. The widely adopted worst-case latency objective evaluates team performance over the entire time horizon and therefore may fail to distinguish strategies with poor transient behavior but strong asymptotic performance. To address this limitation, we propose a family of tail-performance objectives that generalize the standard objective and study the resulting functional optimization problems. We establish several key theoretical properties, including the existence of optimal strategies, relationships among the proposed objectives and their corresponding optimization problems, approximation by periodic solutions to arbitrary accuracy, and reductions to event-driven decision models with discretized waiting times. Building on these results, we construct an equivalent event-driven Markov decision process (MDP), called the Tail Worst-case Latency-Optimizing Markov Decision Process (TWLO-MDP), which reformulates the tail-performance objective as a standard average-reward criterion. We then develop reinforcement-learning-based solution methods for the TWLO-MDP and introduce the multi-robot monitoring benchmark (M2Bench), a unified platform that supports the evaluation and comparison of heuristic and learning-based monitoring algorithms. Experiments on synthetic and realistic monitoring scenarios show that our methods effectively reduce the worst-case weighted latency and outperform representative baselines.
Abstract:Mixture-of-Agents (MoA) improves LLM performance through layered collaboration, but its dense topology raises costs and latency. Existing methods employ LLM judges to filter responses, yet still require all models to perform inference before judging, failing to cut costs effectively. They also lack model selection criteria and struggle with large model pools, where full inference is costly and can exceed context limits. To address this, we propose RouteMoA, an efficient mixture-of-agents framework with dynamic routing. It employs a lightweight scorer to perform initial screening by predicting coarse-grained performance from the query, narrowing candidates to a high-potential subset without inference. A mixture of judges then refines these scores through lightweight self- and cross-assessment based on existing model outputs, providing posterior correction without additional inference. Finally, a model ranking mechanism selects models by balancing performance, cost, and latency. RouteMoA outperforms MoA across varying tasks and model pool sizes, reducing cost by 89.8% and latency by 63.6% in the large-scale model pool.
Abstract:Vision-Language-Action (VLA) models are experiencing rapid development and demonstrating promising capabilities in robotic manipulation tasks. However, scaling up VLA models presents several critical challenges: (1) Training new VLA models from scratch demands substantial computational resources and extensive datasets. Given the current scarcity of robot data, it becomes particularly valuable to fully leverage well-pretrained VLA model weights during the scaling process. (2) Real-time control requires carefully balancing model capacity with computational efficiency. To address these challenges, We propose AdaMoE, a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models, and scales up the action expert by substituting the feedforward layers into sparsely activated MoE layers. AdaMoE employs a decoupling technique that decouples expert selection from expert weighting through an independent scale adapter working alongside the traditional router. This enables experts to be selected based on task relevance while contributing with independently controlled weights, allowing collaborative expert utilization rather than winner-takes-all dynamics. Our approach demonstrates that expertise need not monopolize. Instead, through collaborative expert utilization, we can achieve superior performance while maintaining computational efficiency. AdaMoE consistently outperforms the baseline model across key benchmarks, delivering performance gains of 1.8% on LIBERO and 9.3% on RoboTwin. Most importantly, a substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.




Abstract:The safety of training task policies and their subsequent application using reinforcement learning (RL) methods has become a focal point in the field of safe RL. A central challenge in this area remains the establishment of theoretical guarantees for safety during both the learning and deployment processes. Given the successful implementation of Control Barrier Function (CBF)-based safety strategies in a range of control-affine robotic systems, CBF-based safe RL demonstrates significant promise for practical applications in real-world scenarios. However, integrating these two approaches presents several challenges. First, embedding safety optimization within the RL training pipeline requires that the optimization outputs be differentiable with respect to the input parameters, a condition commonly referred to as differentiable optimization, which is non-trivial to solve. Second, the differentiable optimization framework confronts significant efficiency issues, especially when dealing with multi-constraint problems. To address these challenges, this paper presents a CBF-based safe RL architecture that effectively mitigates the issues outlined above. The proposed approach constructs a continuous AND logic approximation for the multiple constraints using a single composite CBF. By leveraging this approximation, a close-form solution of the quadratic programming is derived for the policy network in RL, thereby circumventing the need for differentiable optimization within the end-to-end safe RL pipeline. This strategy significantly reduces computational complexity because of the closed-form solution while maintaining safety guarantees. Simulation results demonstrate that, in comparison to existing approaches relying on differentiable optimization, the proposed method significantly reduces training computational costs while ensuring provable safety throughout the training process.
Abstract:Blending green hydrogen into natural gas presents a promising approach for renewable energy integration and fuel decarbonization. Accurate estimation of hydrogen fraction in hydrogen-enriched natural gas (HENG) pipeline networks is crucial for operational safety and efficiency, yet it remains challenging due to complex dynamics. While existing data-driven approaches adopt end-to-end architectures for HENG flow state estimation, their limited adaptability to varying operational conditions hinders practical applications. To this end, this study proposes a graph-enhanced DeepONet framework for the real-time estimation of HENG flow, especially hydrogen fractions. First, a dual-network architecture, called branch network and trunk network, is employed to characterize operational conditions and sparse sensor measurements to estimate the HENG state at targeted locations and time points. Second, a graph-enhance branch network is proposed to incorporate pipeline topology, improving the estimation accuracy in large-scale pipeline networks. Experimental results demonstrate that the proposed method achieves superior estimation accuracy for HCNG flow under varying operational conditions compared to conventional approaches.
Abstract:Intelligent reflecting surface (IRS)-assisted mobile edge computing (MEC) systems have shown notable improvements in efficiency, such as reduced latency, higher data rates, and better energy efficiency. However, the resource competition among users will lead to uneven allocation, increased latency, and lower throughput. Fortunately, the rate-splitting multiple access (RSMA) technique has emerged as a promising solution for managing interference and optimizing resource allocation in MEC systems. This paper studies an IRS-assisted MEC system with RSMA, aiming to jointly optimize the passive beamforming of the IRS, the active beamforming of the base station, the task offloading allocation, the transmit power of users, the ratios of public and private information allocation, and the decoding order of the RSMA to minimize the average delay from a novel uplink transmission perspective. Since the formulated problem is non-convex and the optimization variables are highly coupled, we propose a hierarchical deep reinforcement learning-based algorithm to optimize both continuous and discrete variables of the problem. Additionally, to better extract channel features, we design a novel network architecture within the policy and evaluation networks of the proposed algorithm, combining convolutional neural networks and densely connected convolutional network for feature extraction. Simulation results indicate that the proposed algorithm not only exhibits excellent convergence performance but also outperforms various benchmarks.



Abstract:Unmanned aerial vehicles (UAVs) assisted Internet of things (IoT) systems have become an important part of future wireless communications. To achieve higher communication rate, the joint design of UAV trajectory and resource allocation is crucial. This letter considers a scenario where a multi-antenna UAV is dispatched to simultaneously collect data from multiple ground IoT nodes (GNs) within a time interval. To improve the sum data collection (SDC) volume, i.e., the total data volume transmitted by the GNs, the UAV trajectory, the UAV receive beamforming, the scheduling of the GNs, and the transmit power of the GNs are jointly optimized. Since the problem is non-convex and the optimization variables are highly coupled, it is hard to solve using traditional optimization methods. To find a near-optimal solution, a double-loop structured optimization-driven deep reinforcement learning (DRL) algorithm and a fully DRL-based algorithm are proposed to solve the problem effectively. Simulation results verify that the proposed algorithms outperform two benchmarks with significant improvement in SDC volumes.




Abstract:Intelligent reflecting surface (IRS) and rate-splitting multiple access (RSMA) technologies are at the forefront of enhancing spectrum and energy efficiency in the next generation multi-antenna communication systems. This paper explores a RSMA system with multiple IRSs, and proposes two purpose-driven scheduling schemes, i.e., the exhaustive IRS-aided (EIA) and opportunistic IRS-aided (OIA) schemes. The aim is to optimize the system weighted energy efficiency (EE) under the above two schemes, respectively. Specifically, the Dinkelbach, branch and bound, successive convex approximation, and the semidefinite relaxation methods are exploited within the alternating optimization framework to obtain effective solutions to the considered problems. The numerical findings indicate that the EIA scheme exhibits better performance compared to the OIA scheme in diverse scenarios when considering the weighted EE, and the proposed algorithm demonstrates superior performance in comparison to the baseline algorithms.