Alert button
Picture for P. R. Kumar

P. R. Kumar

Alert button

Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation

Jul 17, 2023
Ruida Zhou, Tao Liu, Min Cheng, Dileep Kalathil, P. R. Kumar, Chao Tian

Figure 1 for Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation
Figure 2 for Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation
Figure 3 for Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation
Figure 4 for Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation

We study robust reinforcement learning (RL) with the goal of determining a well-performing policy that is robust against model mismatch between the training simulator and the testing environment. Previous policy-based robust RL algorithms mainly focus on the tabular setting under uncertainty sets that facilitate robust policy evaluation, but are no longer tractable when the number of states scales up. To this end, we propose two novel uncertainty set formulations, one based on double sampling and the other on an integral probability metric. Both make large-scale robust RL tractable even when one only has access to a simulator. We propose a robust natural actor-critic (RNAC) approach that incorporates the new uncertainty sets and employs function approximation. We provide finite-time convergence guarantees for the proposed RNAC algorithm to the optimal robust policy within the function approximation error. Finally, we demonstrate the robust performance of the policy learned by our proposed RNAC approach in multiple MuJoCo environments and a real-world TurtleBot navigation task.

Viaarxiv icon

Finite Time Regret Bounds for Minimum Variance Control of Autoregressive Systems with Exogenous Inputs

May 26, 2023
Rahul Singh, Akshay Mete, Avik Kar, P. R. Kumar

Figure 1 for Finite Time Regret Bounds for Minimum Variance Control of Autoregressive Systems with Exogenous Inputs
Figure 2 for Finite Time Regret Bounds for Minimum Variance Control of Autoregressive Systems with Exogenous Inputs
Figure 3 for Finite Time Regret Bounds for Minimum Variance Control of Autoregressive Systems with Exogenous Inputs
Figure 4 for Finite Time Regret Bounds for Minimum Variance Control of Autoregressive Systems with Exogenous Inputs

Minimum variance controllers have been employed in a wide-range of industrial applications. A key challenge experienced by many adaptive controllers is their poor empirical performance in the initial stages of learning. In this paper, we address the problem of initializing them so that they provide acceptable transients, and also provide an accompanying finite-time regret analysis, for adaptive minimum variance control of an auto-regressive system with exogenous inputs (ARX). Following [3], we consider a modified version of the Certainty Equivalence (CE) adaptive controller, which we call PIECE, that utilizes probing inputs for exploration. We show that it has a $C \log T$ bound on the regret after $T$ time-steps for bounded noise, and $C\log^2 T$ in the case of sub-Gaussian noise. The simulation results demonstrate the advantage of PIECE over the algorithm proposed in [3] as well as the standard Certainty Equivalence controller especially in the initial learning phase. To the best of our knowledge, this is the first work that provides finite-time regret bounds for an adaptive minimum variance controller.

Viaarxiv icon

Recommender system as an exploration coordinator: a bounded O(1) regret algorithm for large platforms

Jan 29, 2023
Hyunwook Kang, P. R. Kumar

Figure 1 for Recommender system as an exploration coordinator: a bounded O(1) regret algorithm for large platforms

On typical modern platforms, users are only able to try a small fraction of the available items. This makes it difficult to model the exploration behavior of platform users as typical online learners who explore all the items. Towards addressing this issue, we propose to interpret a recommender system as a bandit exploration coordinator that provides counterfactual information updates. In particular, we introduce a novel algorithm called Counterfactual UCB (CFUCB) which is guarantees user exploration coordination with bounded regret under the presence of linear representations. Our results show that sharing information is a Subgame Perfect Nash Equilibrium for agents in terms of regret, leading to each agent achieving bounded regret. This approach has potential applications in personalized recommender systems and adaptive experimentation.

Viaarxiv icon

TERRA: Beam Management for Outdoor mm-Wave Networks

Jan 10, 2023
Santosh Ganji, Jaewon Kim, Romil Sonigra, P. R. Kumar

Figure 1 for TERRA: Beam Management for Outdoor mm-Wave Networks
Figure 2 for TERRA: Beam Management for Outdoor mm-Wave Networks
Figure 3 for TERRA: Beam Management for Outdoor mm-Wave Networks
Figure 4 for TERRA: Beam Management for Outdoor mm-Wave Networks

mm-Wave communication systems use narrow directional beams due to the spectrum's characteristic nature: high path and penetration losses. The mobile and the base station primarily employ beams in line of sight (LoS) direction and when needed in non-line of sight direction. Beam management protocol adapts the base station and mobile side beam direction during user mobility and to sustain the link during blockages. To avoid outage in transient pedestrian blockage of the LoS path, the mobile uses reflected or NLoS path available in indoor environments. Reflected paths can sustain time synchronization and maintain connectivity during temporary blockages. In outdoor environments, such reflections may not be available and prior work relied on dense base station deployment or co-ordinated multi-point access to address outage problem. Instead of dense and hence cost-intensive network deployments, we found experimentally that the mobile can capitalize on ground reflection. We developed TERRA protocol to effectively handle mobile side beam direction during transient blockage events. TERRA avoids outage during pedestrian blockages 84.5 $\%$ of the time in outdoor environments on concrete and gravel surfaces. TERRA also enables the mobile to perform a soft handover to a reserve neighbor base station in the event of a permanent blockage, without requiring any side information, unlike the existing works. Evaluations show that TERRA maintains received signal strength close to the optimal solution while keeping track of the neighbor base station.

Viaarxiv icon

Energy System Digitization in the Era of AI: A Three-Layered Approach towards Carbon Neutrality

Nov 02, 2022
Le Xie, Tong Huang, Xiangtian Zheng, Yan Liu, Mengdi Wang, Vijay Vittal, P. R. Kumar, Srinivas Shakkottai, Yi Cui

Figure 1 for Energy System Digitization in the Era of AI: A Three-Layered Approach towards Carbon Neutrality
Figure 2 for Energy System Digitization in the Era of AI: A Three-Layered Approach towards Carbon Neutrality
Figure 3 for Energy System Digitization in the Era of AI: A Three-Layered Approach towards Carbon Neutrality
Figure 4 for Energy System Digitization in the Era of AI: A Three-Layered Approach towards Carbon Neutrality

The transition towards carbon-neutral electricity is one of the biggest game changers in addressing climate change since it addresses the dual challenges of removing carbon emissions from the two largest sectors of emitters: electricity and transportation. The transition to a carbon-neutral electric grid poses significant challenges to conventional paradigms of modern grid planning and operation. Much of the challenge arises from the scale of the decision making and the uncertainty associated with the energy supply and demand. Artificial Intelligence (AI) could potentially have a transformative impact on accelerating the speed and scale of carbon-neutral transition, as many decision making processes in the power grid can be cast as classic, though challenging, machine learning tasks. We point out that to amplify AI's impact on carbon-neutral transition of the electric energy systems, the AI algorithms originally developed for other applications should be tailored in three layers of technology, markets, and policy.

* To be published in Patterns (Cell Press) 
Viaarxiv icon

Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning

Jun 10, 2022
Ruida Zhou, Tao Liu, Dileep Kalathil, P. R. Kumar, Chao Tian

Figure 1 for Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning
Figure 2 for Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning
Figure 3 for Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning
Figure 4 for Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning

We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions, which are to be jointly optimized according to given criteria such as proportional fairness (smooth concave scalarization), hard constraints (constrained MDP), and max-min trade-off. We propose an Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework, which can systematically incorporate ideas from well-performing first-order methods into the design of policy optimization algorithms for multi-objective MDP problems. Theoretically, the designed algorithms based on the ARNPG framework achieve $\tilde{O}(1/T)$ global convergence with exact gradients. Empirically, the ARNPG-guided algorithms also demonstrate superior performance compared to some existing policy gradient-based approaches in both exact gradients and sample-based scenarios.

Viaarxiv icon

Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems

Jan 25, 2022
Akshay Mete, Rahul Singh, P. R. Kumar

Figure 1 for Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems
Figure 2 for Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems
Figure 3 for Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems
Figure 4 for Augmented RBMLE-UCB Approach for Adaptive Control of Linear Quadratic Systems

We consider the problem of controlling a stochastic linear system with quadratic costs, when its system parameters are not known to the agent -- called the adaptive LQG control problem. We re-examine an approach called "Reward-Biased Maximum Likelihood Estimate" (RBMLE) that was proposed more than forty years ago, and which predates the "Upper Confidence Bound" (UCB) method as well as the definition of "regret". It simply added a term favoring parameters with larger rewards to the estimation criterion. We propose an augmented approach that combines the penalty of the RBMLE method with the constraint of the UCB method, uniting the two approaches to optimization in the face of uncertainty. We first establish that theoretically this method retains $\mathcal{O}(\sqrt{T})$ regret, the best known so far. We show through a comprehensive simulation study that this augmented RBMLE method considerably outperforms the UCB and Thompson sampling approaches, with a regret that is typically less than 50\% of the better of their regrets. The simulation study includes all examples from earlier papers as well as a large collection of randomly generated systems.

Viaarxiv icon

Overcoming Pedestrian Blockage in mm-Wave Bands using Ground Reflections

Nov 16, 2021
Santosh Ganji, Romil Sonigra, P. R. Kumar

Figure 1 for Overcoming Pedestrian Blockage in mm-Wave Bands using Ground Reflections
Figure 2 for Overcoming Pedestrian Blockage in mm-Wave Bands using Ground Reflections
Figure 3 for Overcoming Pedestrian Blockage in mm-Wave Bands using Ground Reflections
Figure 4 for Overcoming Pedestrian Blockage in mm-Wave Bands using Ground Reflections

mm-Wave communication employs directional beams to overcome high path loss. High data rate communication is typically along line-of-sight (LoS). In outdoor environments, such communication is susceptible to temporary blockage by pedestrians interposed between the transmitter and receiver. It results in outages in which the user is lost, and has to be reacquired as a new user, severely disrupting interactive and high throughput applications. It has been presumed that the solution is to have a densely deployed set of base stations that will allow the mobile to perform a handover to a different non-blocked base station every time a current base station is blocked. This is however a very costly solution for outdoor environments. Through extensive experiments we show that it is possible to exploit a strong ground reflection with a received signal strength (RSS) about 4dB less than the LoS path in outdoor built environments with concrete or gravel surfaces, for beams that are narrow in azimuth but wide in zenith. While such reflected paths cannot support the high data rates of LoS paths, they can support control channel communication, and, importantly, sustain time synchronization between the mobile and the base station. This allows a mobile to quickly recover to the LoS path upon the cessation of the temporary blockage, which typically lasts a few hundred milliseconds. We present a simple in-band protocol that quickly discovers ground reflected radiation and uses it to recover the LoS link when the temporary blockage disappears.

Viaarxiv icon

Fast Global Convergence of Policy Optimization for Constrained MDPs

Oct 31, 2021
Tao Liu, Ruida Zhou, Dileep Kalathil, P. R. Kumar, Chao Tian

Figure 1 for Fast Global Convergence of Policy Optimization for Constrained MDPs

We address the issue of safety in reinforcement learning. We pose the problem in a discounted infinite-horizon constrained Markov decision process framework. Existing results have shown that gradient-based methods are able to achieve an $\mathcal{O}(1/\sqrt{T})$ global convergence rate both for the optimality gap and the constraint violation. We exhibit a natural policy gradient-based algorithm that has a faster convergence rate $\mathcal{O}(\log(T)/T)$ for both the optimality gap and the constraint violation. When Slater's condition is satisfied and known a priori, zero constraint violation can be further guaranteed for a sufficiently large $T$ while maintaining the same convergence rate.

Viaarxiv icon