Abstract:Reinforcement learning (RL) has demonstrated significant promise in enhancing the reasoning capabilities of Text2SQL LLMs, especially with advanced algorithms such as GRPO and DAPO. However, the performance of these methods is highly sensitive to the design of reward functions. Inappropriate rewards can lead to reward hacking, where models exploit loopholes in the reward structure to achieve high scores without genuinely solving the task. This work considers a constrained RL framework for Text2SQL that incorporates natural and interpretable reward and constraint signals, while dynamically balancing trade-offs among them during the training. We establish the theoretical guarantees of our constrained RL framework and our numerical experiments on the well-known Text2SQL datasets substantiate the improvement of our approach over the state-of-the-art RL-trained LLMs.
Abstract:Recently, diffusion models have gained popularity and attention in trajectory optimization due to their capability of modeling multi-modal probability distributions. However, addressing nonlinear equality constraints, i.e, dynamic feasi- bility, remains a great challenge in diffusion-based trajectory optimization. Recent diffusion-based trajectory optimization frameworks rely on a single-shooting style approach where the denoised control sequence is applied to forward propagate the dynamical system, which cannot explicitly enforce constraints on the states and frequently leads to sub-optimal solutions. In this work, we propose a novel direct trajectory optimization approach via model-based diffusion, which directly generates a sequence of states. To ensure dynamic feasibility, we propose a gradient-free projection mechanism that is incorporated into the reverse diffusion process. Our results show that, compared to a recent state-of-the-art baseline, our approach leads to zero dynamic feasibility error and approximately 4x higher success rate in a quadrotor waypoint navigation scenario involving dense static obstacles.
Abstract:In this work, we present a method for minimizing the time required for a redundant dual-arm robot to follow a desired relative Cartesian path at constant path speed by optimizing its joint trajectories, subject to position, velocity, and acceleration limits. The problem is reformulated as a bi-level optimization whose lower level is a convex, closed-form subproblem that maximizes path speed for a fixed trajectory, while the upper level updates the trajectory using a single-chain kinematic formulation and the subgradient of the lower-level value. Numerical results demonstrate the effectiveness of the proposed approach.
Abstract:Transformer models (TMs) have exhibited remarkable in-context reinforcement learning (ICRL) capabilities, allowing them to generalize to and improve in previously unseen environments without re-training or fine-tuning. This is typically accomplished by imitating the complete learning histories of a source RL algorithm over a substantial amount of pretraining environments, which, however, may transfer suboptimal behaviors inherited from the source algorithm/dataset. Therefore, in this work, we address the issue of inheriting suboptimality from the perspective of dataset preprocessing. Motivated by the success of the weighted empirical risk minimization, we propose a simple yet effective approach, learning history filtering (LHF), to enhance ICRL by reweighting and filtering the learning histories based on their improvement and stability characteristics. To the best of our knowledge, LHF is the first approach to avoid source suboptimality by dataset preprocessing, and can be combined with the current state-of-the-art (SOTA) ICRL algorithms. We substantiate the effectiveness of LHF through a series of experiments conducted on the well-known ICRL benchmarks, encompassing both discrete environments and continuous robotic manipulation tasks, with three SOTA ICRL algorithms (AD, DPT, DICP) as the backbones. LHF exhibits robust performance across a variety of suboptimal scenarios, as well as under varying hyperparameters and sampling strategies. Notably, the superior performance of LHF becomes more pronounced in the presence of noisy data, indicating the significance of filtering learning histories.




Abstract:Multi-agent reinforcement learning is a challenging and active field of research due to the inherent nonstationary property and coupling between agents. A popular approach to modeling the multi-agent interactions underlying the multi-agent RL problem is the Markov Game. There is a special type of Markov Game, termed Markov Potential Game, which allows us to reduce the Markov Game to a single-objective optimal control problem where the objective function is a potential function. In this work, we prove that a multi-agent collaborative field coverage problem, which is found in many engineering applications, can be formulated as a Markov Potential Game, and we can learn a parameterized closed-loop Nash Equilibrium by solving an equivalent single-objective optimal control problem. As a result, our algorithm is 10x faster during training compared to a game-theoretic baseline and converges faster during policy execution.
Abstract:Many inference scaling methods work by breaking a problem into smaller steps (or groups of tokens), then sampling and choosing the best next step. However, these steps and their sizes are usually predetermined based on human intuition or domain knowledge. This paper introduces dynamic decomposition, a method that automatically and adaptively splits solution and reasoning traces into steps during inference. This approach improves computational efficiency by focusing more resources on difficult steps, breaking them down further and prioritizing their sampling. Experiments on coding and math benchmarks (APPS, MATH, and LiveCodeBench) show that dynamic decomposition performs better than static methods, which rely on fixed steps like token-level, sentence-level, or single-step decompositions. These results suggest that dynamic decomposition can enhance many inference scaling techniques.
Abstract:In this paper, we present a control design methodology, stability criteria, and performance bounds for autonomous helicopter aerial refueling. Autonomous aerial refueling is particularly difficult due to the aerodynamic interaction between the wake of the tanker, the contact-sensitive nature of the maneuver, and the uncertainty in drogue motion. Since the probe tip is located significantly away from the helicopter's center-of-gravity, its position (and velocity) is strongly sensitive to the helicopter's attitude (and angular rates). In addition, the fact that the helicopter is operating at high speeds to match the velocity of the tanker forces it to maintain a particular orientation, making the docking maneuver especially challenging. In this paper, we propose a novel outer-loop position controller that incorporates the probe position and velocity into the feedback loop. The position and velocity of the probe tip depend both on the position (velocity) and on the attitude (angular rates) of the aircraft. We derive analytical guarantees for docking performance in terms of the uncertainty of the drogue motion and the angular acceleration of the helicopter, using the ultimate boundedness property of the closed-loop error dynamics. Simulations are performed on a high-fidelity UH60 helicopter model with a high-fidelity drogue motion under wind effects to validate the proposed approach for realistic refueling scenarios. These high-fidelity simulations reveal that the proposed control methodology yields an improvement of 36% in the 2-norm docking error compared to the existing standard controller.


Abstract:In pursuit of reinforcement learning systems that could train in physical environments, we investigate multi-task approaches as a means to alleviate the need for massive data acquisition. In a tabular scenario where the Q-functions are collected across tasks, we model our learning problem as optimizing a higher order tensor structure. Recognizing that close-related tasks may require similar actions, our proposed method imposes a low-rank condition on this aggregated Q-tensor. The rationale behind this approach to multi-task learning is that the low-rank structure enforces the notion of similarity, without the need to explicitly prescribe which tasks are similar, but inferring this information from a reduced amount of data simultaneously with the stochastic optimization of the Q-tensor. The efficiency of our low-rank tensor approach to multi-task learning is demonstrated in two numerical experiments, first in a benchmark environment formed by a collection of inverted pendulums, and then into a practical scenario involving multiple wireless communication devices.




Abstract:In this work, we present an approach to minimizing the time necessary for the end-effector of a redundant robot manipulator to traverse a Cartesian path by optimizing the trajectory of its joints. Each joint has limits in the ranges of position, velocity and acceleration, the latter making jerks in joint space undesirable. The proposed approach takes this nonlinear optimization problem whose variables are path speed and joint trajectory and reformulates it into a bi-level problem. The lower-level formulation is a convex subproblem that considers a fixed joint trajectory and maximizes path speed while considering all joint velocity and acceleration constraints. Under particular conditions, this subproblem has a closed-form solution. Then, we solve a higher-level subproblem by leveraging the directional derivative of the lower-level value with respect to the joint trajectory parameters. In particular, we use this direction to implement a Primal-Dual method that considers the path accuracy and joint position constraints. We show the efficacy of our proposed approach with simulations and experimental results.
Abstract:Electricity grid's resiliency and climate change strongly impact one another due to an array of technical and policy-related decisions that impact both. This paper introduces a physics-informed machine learning-based framework to enhance grid's resiliency. Specifically, when encountering disruptive events, this paper designs remedial control actions to prevent blackouts. The proposed Physics-Guided Reinforcement Learning (PG-RL) framework determines effective real-time remedial line-switching actions, considering their impact on power balance, system security, and grid reliability. To identify an effective blackout mitigation policy, PG-RL leverages power-flow sensitivity factors to guide the RL exploration during agent training. Comprehensive evaluations using the Grid2Op platform demonstrate that incorporating physical signals into RL significantly improves resource utilization within electric grids and achieves better blackout mitigation policies - both of which are critical in addressing climate change.