Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Abstract:Two-time-scale optimization is a framework introduced in Zeng et al. (2024) that abstracts a range of policy evaluation and policy optimization problems in reinforcement learning (RL). Akin to bi-level optimization under a particular type of stochastic oracle, the two-time-scale optimization framework has an upper level objective whose gradient evaluation depends on the solution of a lower level problem, which is to find the root of a strongly monotone operator. In this work, we propose a new method for solving two-time-scale optimization that achieves significantly faster convergence than the prior arts. The key idea of our approach is to leverage an averaging step to improve the estimates of the operators in both lower and upper levels before using them to update the decision variables. These additional averaging steps eliminate the direct coupling between the main variables, enabling the accelerated performance of our algorithm. We characterize the finite-time convergence rates of the proposed algorithm under various conditions of the underlying objective function, including strong convexity, convexity, Polyak-Lojasiewicz condition, and general non-convexity. These rates significantly improve over the best-known complexity of the standard two-time-scale stochastic approximation algorithm. When applied to RL, we show how the proposed algorithm specializes to novel online sample-based methods that surpass or match the performance of the existing state of the art. Finally, we support our theoretical results with numerical simulations in RL.

Via

Figures and Tables:

Abstract:Multi-task reinforcement learning (RL) aims to find a single policy that effectively solves multiple tasks at the same time. This paper presents a constrained formulation for multi-task RL where the goal is to maximize the average performance of the policy across tasks subject to bounds on the performance in each task. We consider solving this problem both in the centralized setting, where information for all tasks is accessible to a single server, and in the decentralized setting, where a network of agents, each given one task and observing local information, cooperate to find the solution of the globally constrained objective using local communication. We first propose a primal-dual algorithm that provably converges to the globally optimal solution of this constrained formulation under exact gradient evaluations. When the gradient is unknown, we further develop a sampled-based actor-critic algorithm that finds the optimal policy using online samples of state, action, and reward. Finally, we study the extension of the algorithm to the linear function approximation setting.

Via

Abstract:At the heart of power system operations, alternating current optimal power flow (ACOPF) studies the generation of electric power in the most economical way under network-wide load requirement, and can be formulated as a highly structured non-convex quadratically constrained quadratic program (QCQP). Optimization-based solutions to ACOPF (such as ADMM or interior-point method), as the classic approach, require large amount of computation and cannot meet the need to repeatedly solve the problem as load requirement frequently changes. On the other hand, learning-based methods that directly predict the ACOPF solution given the load input incur little computational cost but often generates infeasible solutions (i.e. violate the constraints of ACOPF). In this work, we combine the best of both worlds -- we propose an innovated framework for learning ACOPF, where the input load is mapped to the ACOPF solution through a neural network in a computationally efficient and reliable manner. Key to our innovation is a specific-purpose "activation function" defined implicitly by a QCQP and a novel loss, which enforce constraint satisfaction. We show through numerical simulations that our proposed method achieves superior feasibility rate and generation cost in situations where the existing learning-based approaches fail.

Via

Abstract:We study learning-based design of fair allocation mechanisms for divisible resources, using proportional fairness (PF) as a benchmark. The learning setting is a significant departure from the classic mechanism design literature, in that, we need to learn fair mechanisms solely from data. In particular, we consider the challenging problem of learning one-shot allocation mechanisms -- without the use of money -- that incentivize strategic agents to be truthful when reporting their valuations. It is well-known that the mechanism that directly seeks to optimize PF is not incentive compatible, meaning that the agents can potentially misreport their preferences to gain increased allocations. We introduce the notion of "exploitability" of a mechanism to measure the relative gain in utility from misreport, and make the following important contributions in the paper: (i) Using sophisticated techniques inspired by differentiable convex programming literature, we design a numerically efficient approach for computing the exploitability of the PF mechanism. This novel contribution enables us to quantify the gap that needs to be bridged to approximate PF via incentive compatible mechanisms. (ii) Next, we modify the PF mechanism to introduce a trade-off between fairness and exploitability. By properly controlling this trade-off using data, we show that our proposed mechanism, ExPF-Net, provides a strong approximation to the PF mechanism while maintaining low exploitability. This mechanism, however, comes with a high computational cost. (iii) To address the computational challenges, we propose another mechanism ExS-Net, which is end-to-end parameterized by a neural network. ExS-Net enjoys similar (slightly inferior) performance and significantly accelerated training and inference time performance. (iv) Extensive numerical simulations demonstrate the robustness and efficacy of the proposed mechanisms.

Via

Figures and Tables:

Abstract:The aim of this paper is to improve the understanding of the optimization landscape for policy optimization problems in reinforcement learning. Specifically, we show that the superlevel set of the objective function with respect to the policy parameter is always a connected set both in the tabular setting and under policies represented by a class of neural networks. In addition, we show that the optimization objective as a function of the policy parameter and reward satisfies a stronger "equiconnectedness" property. To our best knowledge, these are novel and previously unknown discoveries. We present an application of the connectedness of these superlevel sets to the derivation of minimax theorems for robust reinforcement learning. We show that any minimax optimization program which is convex on one side and is equiconnected on the other side observes the minimax equality (i.e. has a Nash equilibrium). We find that this exact structure is exhibited by an interesting robust reinforcement learning problem under an adversarial reward attack, and the validity of its minimax equality immediately follows. This is the first time such a result is established in the literature.

Via

Figures and Tables:

Abstract:We study the sequential decision-making problem of allocating a limited resource to agents that reveal their stochastic demands on arrival over a finite horizon. Our goal is to design fair allocation algorithms that exhaust the available resource budget. This is challenging in sequential settings where information on future demands is not available at the time of decision-making. We formulate the problem as a discrete time Markov decision process (MDP). We propose a new algorithm, SAFFE, that makes fair allocations with respect to the entire demands revealed over the horizon by accounting for expected future demands at each arrival time. The algorithm introduces regularization which enables the prioritization of current revealed demands over future potential demands depending on the uncertainty in agents' future demands. Using the MDP formulation, we show that SAFFE optimizes allocations based on an upper bound on the Nash Social Welfare fairness objective, and we bound its gap to optimality with the use of concentration bounds on total future demands. Using synthetic and real data, we compare the performance of SAFFE against existing approaches and a reinforcement learning policy trained on the MDP. We show that SAFFE leads to more fair and efficient allocations and achieves close-to-optimal performance in settings with dense arrivals.

Via

Figures and Tables:

Abstract:We study the problem of finding the Nash equilibrium in a two-player zero-sum Markov game. Due to its formulation as a minimax optimization program, a natural approach to solve the problem is to perform gradient descent/ascent with respect to each player in an alternating fashion. However, due to the non-convexity/non-concavity of the underlying objective function, theoretical understandings of this method are limited. In our paper, we consider solving an entropy-regularized variant of the Markov game. The regularization introduces structure into the optimization landscape that make the solutions more identifiable and allow the problem to be solved more efficiently. Our main contribution is to show that under proper choices of the regularization parameter, the gradient descent ascent algorithm converges to the Nash equilibrium of the original unregularized problem. We explicitly characterize the finite-time performance of the last iterate of our algorithm, which vastly improves over the existing convergence bound of the gradient descent ascent algorithm without regularization. Finally, we complement the analysis with numerical simulations that illustrate the accelerated convergence of the algorithm.

Via

Figures and Tables:

Abstract:With the increasing penetration of distributed energy resources, distributed optimization algorithms have attracted significant attention for power systems applications due to their potential for superior scalability, privacy, and robustness to a single point-of-failure. The Alternating Direction Method of Multipliers (ADMM) is a popular distributed optimization algorithm; however, its convergence performance is highly dependent on the selection of penalty parameters, which are usually chosen heuristically. In this work, we use reinforcement learning (RL) to develop an adaptive penalty parameter selection policy for the AC optimal power flow (ACOPF) problem solved via ADMM with the goal of minimizing the number of iterations until convergence. We train our RL policy using deep Q-learning, and show that this policy can result in significantly accelerated convergence (up to a 59% reduction in the number of iterations compared to existing, curvature-informed penalty parameter selection methods). Furthermore, we show that our RL policy demonstrates promise for generalizability, performing well under unseen loading schemes as well as under unseen losses of lines and generators (up to a 50% reduction in iterations). This work thus provides a proof-of-concept for using RL for parameter selection in ADMM for power systems applications.

Via

Figures and Tables:

Abstract:We consider a discounted cost constrained Markov decision process (CMDP) policy optimization problem, in which an agent seeks to maximize a discounted cumulative reward subject to a number of constraints on discounted cumulative utilities. To solve this constrained optimization program, we study an online actor-critic variant of a classic primal-dual method where the gradients of both the primal and dual functions are estimated using samples from a single trajectory generated by the underlying time-varying Markov processes. This online primal-dual natural actor-critic algorithm maintains and iteratively updates three variables: a dual variable (or Lagrangian multiplier), a primal variable (or actor), and a critic variable used to estimate the gradients of both primal and dual variables. These variables are updated simultaneously but on different time scales (using different step sizes) and they are all intertwined with each other. Our main contribution is to derive a finite-time analysis for the convergence of this algorithm to the global optimum of a CMDP problem. Specifically, we show that with a proper choice of step sizes the optimality gap and constraint violation converge to zero in expectation at a rate $\mathcal{O}(1/K^{1/6})$, where K is the number of iterations. To our knowledge, this paper is the first to study the finite-time complexity of an online primal-dual actor-critic method for solving a CMDP problem. We also validate the effectiveness of this algorithm through numerical simulations.

Via

Figures and Tables:

Abstract:We study a novel two-time-scale stochastic gradient method for solving optimization problems where the gradient samples are generated from a time-varying Markov random process parameterized by the underlying optimization variable. These time-varying samples make the stochastic gradient biased and dependent, which can potentially lead to the divergence of the iterates. To address this issue, we consider a two-time-scale update scheme, where one scale is used to estimate the true gradient from the Markovian samples and the other scale is used to update the decision variable with the estimated gradient. While these two iterates are implemented simultaneously, the former is updated "faster" (using bigger step sizes) than the latter (using smaller step sizes). Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale stochastic gradient method. In particular, we provide explicit formulas for the convergence rates of this method under different objective functions, namely, strong convexity, convexity, non-convexity under the PL condition, and general non-convexity. Our second contribution is to apply our framework to study the performance of the popular actor-critic methods in solving stochastic control and reinforcement learning problems. First, we study an online natural actor-critic algorithm for the linear-quadratic regulator and show that a convergence rate of $\mathcal{O}(k^{-2/3})$ is achieved. This is the first time such a result is known in the literature. Second, we look at the standard online actor-critic algorithm over finite state and action spaces and derive a convergence rate of $\mathcal{O}(k^{-2/5})$, which recovers the best known rate derived specifically for this problem. Finally, we support our theoretical analysis with numerical simulations where the convergence rate is visualized.

Via