Alert button
Picture for Jingliang Duan

Jingliang Duan

Alert button

Smoothing Policy Iteration for Zero-sum Markov Games

Dec 03, 2022
Yangang Ren, Yao Lyu, Wenxuan Wang, Shengbo Eben Li, Zeyang Li, Jingliang Duan

Figure 1 for Smoothing Policy Iteration for Zero-sum Markov Games
Figure 2 for Smoothing Policy Iteration for Zero-sum Markov Games
Figure 3 for Smoothing Policy Iteration for Zero-sum Markov Games
Figure 4 for Smoothing Policy Iteration for Zero-sum Markov Games

Zero-sum Markov Games (MGs) has been an efficient framework for multi-agent systems and robust control, wherein a minimax problem is constructed to solve the equilibrium policies. At present, this formulation is well studied under tabular settings wherein the maximum operator is primarily and exactly solved to calculate the worst-case value function. However, it is non-trivial to extend such methods to handle complex tasks, as finding the maximum over large-scale action spaces is usually cumbersome. In this paper, we propose the smoothing policy iteration (SPI) algorithm to solve the zero-sum MGs approximately, where the maximum operator is replaced by the weighted LogSumExp (WLSE) function to obtain the nearly optimal equilibrium policies. Specially, the adversarial policy is served as the weight function to enable an efficient sampling over action spaces.We also prove the convergence of SPI and analyze its approximation error in $\infty -$norm based on the contraction mapping theorem. Besides, we propose a model-based algorithm called Smooth adversarial Actor-critic (SaAC) by extending SPI with the function approximations. The target value related to WLSE function is evaluated by the sampled trajectories and then mean square error is constructed to optimize the value function, and the gradient-ascent-descent methods are adopted to optimize the protagonist and adversarial policies jointly. In addition, we incorporate the reparameterization technique in model-based gradient back-propagation to prevent the gradient vanishing due to sampling from the stochastic policies. We verify our algorithm in both tabular and function approximation settings. Results show that SPI can approximate the worst-case value function with a high accuracy and SaAC can stabilize the training process and improve the adversarial robustness in a large margin.

Viaarxiv icon

Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate

Oct 14, 2022
Dongjie Yu, Wenjun Zou, Yujie Yang, Haitong Ma, Shengbo Eben Li, Jingliang Duan, Jianyu Chen

Figure 1 for Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate
Figure 2 for Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate
Figure 3 for Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate
Figure 4 for Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate

Safe reinforcement learning (RL) that solves constraint-satisfactory policies provides a promising way to the broader safety-critical applications of RL in real-world problems such as robotics. Among all safe RL approaches, model-based methods reduce training time violations further due to their high sample efficiency. However, lacking safety robustness against the model uncertainties remains an issue in safe model-based RL, especially in training time safety. In this paper, we propose a distributional reachability certificate (DRC) and its Bellman equation to address model uncertainties and characterize robust persistently safe states. Furthermore, we build a safe RL framework to resolve constraints required by the DRC and its corresponding shield policy. We also devise a line search method to maintain safety and reach higher returns simultaneously while leveraging the shield policy. Comprehensive experiments on classical benchmarks such as constrained tracking and navigation indicate that the proposed algorithm achieves comparable returns with much fewer constraint violations during training.

* 12 pages, 6 figures 
Viaarxiv icon

On the Optimization Landscape of Dynamic Output Feedback: A Case Study for Linear Quadratic Regulator

Sep 12, 2022
Jingliang Duan, Wenhan Cao, Yang Zheng, Lin Zhao

Figure 1 for On the Optimization Landscape of Dynamic Output Feedback: A Case Study for Linear Quadratic Regulator

The convergence of policy gradient algorithms in reinforcement learning hinges on the optimization landscape of the underlying optimal control problem. Theoretical insights into these algorithms can often be acquired from analyzing those of linear quadratic control. However, most of the existing literature only considers the optimization landscape for static full-state or output feedback policies (controllers). We investigate the more challenging case of dynamic output-feedback policies for linear quadratic regulation (abbreviated as dLQR), which is prevalent in practice but has a rather complicated optimization landscape. We first show how the dLQR cost varies with the coordinate transformation of the dynamic controller and then derive the optimal transformation for a given observable stabilizing controller. At the core of our results is the uniqueness of the stationary point of dLQR when it is observable, which is in a concise form of an observer-based controller with the optimal similarity transformation. These results shed light on designing efficient algorithms for general decision-making problems with partially observed information.

* arXiv admin note: substantial text overlap with arXiv:2201.09598 
Viaarxiv icon

Global Convergence of Two-timescale Actor-Critic for Solving Linear Quadratic Regulator

Aug 18, 2022
Xuyang Chen, Jingliang Duan, Yingbin Liang, Lin Zhao

Figure 1 for Global Convergence of Two-timescale Actor-Critic for Solving Linear Quadratic Regulator
Figure 2 for Global Convergence of Two-timescale Actor-Critic for Solving Linear Quadratic Regulator
Figure 3 for Global Convergence of Two-timescale Actor-Critic for Solving Linear Quadratic Regulator
Figure 4 for Global Convergence of Two-timescale Actor-Critic for Solving Linear Quadratic Regulator

The actor-critic (AC) reinforcement learning algorithms have been the powerhouse behind many challenging applications. Nevertheless, its convergence is fragile in general. To study its instability, existing works mostly consider the uncommon double-loop variant or basic models with finite state and action space. We investigate the more practical single-sample two-timescale AC for solving the canonical linear quadratic regulator (LQR) problem, where the actor and the critic update only once with a single sample in each iteration on an unbounded continuous state and action space. Existing analysis cannot conclude the convergence for such a challenging case. We develop a new analysis framework that allows establishing the global convergence to an $\epsilon$-optimal solution with at most an $\tilde{\mathcal{O}}(\epsilon^{-2.5})$ sample complexity. To our knowledge, this is the first finite-time convergence analysis for the single sample two-timescale AC for solving LQR with global optimality. The sample complexity improves those of other variants by orders, which sheds light on the practical wisdom of single sample algorithms. We also further validate our theoretical findings via comprehensive simulation comparisons.

Viaarxiv icon

Improve Generalization of Driving Policy at Signalized Intersections with Adversarial Learning

Apr 09, 2022
Yangang Ren, Guojian Zhan, Liye Tang, Shengbo Eben Li, Jianhua Jiang, Jingliang Duan

Figure 1 for Improve Generalization of Driving Policy at Signalized Intersections with Adversarial Learning
Figure 2 for Improve Generalization of Driving Policy at Signalized Intersections with Adversarial Learning
Figure 3 for Improve Generalization of Driving Policy at Signalized Intersections with Adversarial Learning
Figure 4 for Improve Generalization of Driving Policy at Signalized Intersections with Adversarial Learning

Intersections are quite challenging among various driving scenes wherein the interaction of signal lights and distinct traffic actors poses great difficulty to learn a wise and robust driving policy. Current research rarely considers the diversity of intersections and stochastic behaviors of traffic participants. For practical applications, the randomness usually leads to some devastating events, which should be the focus of autonomous driving. This paper introduces an adversarial learning paradigm to boost the intelligence and robustness of driving policy for signalized intersections with dense traffic flow. Firstly, we design a static path planner which is capable of generating trackable candidate paths for multiple intersections with diversified topology. Next, a constrained optimal control problem (COCP) is built based on these candidate paths wherein the bounded uncertainty of dynamic models is considered to capture the randomness of driving environment. We propose adversarial policy gradient (APG) to solve the COCP wherein the adversarial policy is introduced to provide disturbances by seeking the most severe uncertainty while the driving policy learns to handle this situation by competition. Finally, a comprehensive system is established to conduct training and testing wherein the perception module is introduced and the human experience is incorporated to solve the yellow light dilemma. Experiments indicate that the trained policy can handle the signal lights flexibly meanwhile realizing the smooth and efficient passing with a humanoid paradigm. Besides, APG enables a large-margin improvement of the resistance to the abnormal behaviors and thus ensures a high safety level for the autonomous vehicle.

Viaarxiv icon

Self-learned Intelligence for Integrated Decision and Control of Automated Vehicles at Signalized Intersections

Nov 10, 2021
Yangang Ren, Jianhua Jiang, Dongjie Yu, Shengbo Eben Li, Jingliang Duan, Chen Chen, Keqiang Li

Figure 1 for Self-learned Intelligence for Integrated Decision and Control of Automated Vehicles at Signalized Intersections
Figure 2 for Self-learned Intelligence for Integrated Decision and Control of Automated Vehicles at Signalized Intersections
Figure 3 for Self-learned Intelligence for Integrated Decision and Control of Automated Vehicles at Signalized Intersections
Figure 4 for Self-learned Intelligence for Integrated Decision and Control of Automated Vehicles at Signalized Intersections

Intersection is one of the most complex and accident-prone urban scenarios for autonomous driving wherein making safe and computationally efficient decisions is non-trivial. Current research mainly focuses on the simplified traffic conditions while ignoring the existence of mixed traffic flows, i.e., vehicles, cyclists and pedestrians. For urban roads, different participants leads to a quite dynamic and complex interaction, posing great difficulty to learn an intelligent policy. This paper develops the dynamic permutation state representation in the framework of integrated decision and control (IDC) to handle signalized intersections with mixed traffic flows. Specially, this representation introduces an encoding function and summation operator to construct driving states from environmental observation, capable of dealing with different types and variant number of traffic participants. A constrained optimal control problem is built wherein the objective involves tracking performance and the constraints for different participants and signal lights are designed respectively to assure safety. We solve this problem by offline optimizing encoding function, value function and policy function, wherein the reasonable state representation will be given by the encoding function and then served as the input of policy and value function. An off-policy training is designed to reuse observations from driving environment and backpropagation through time is utilized to update the policy function and encoding function jointly. Verification result shows that the dynamic permutation state representation can enhance the driving performance of IDC, including comfort, decision compliance and safety with a large margin. The trained driving policy can realize efficient and smooth passing in the complex intersection, guaranteeing driving intelligence and safety simultaneously.

Viaarxiv icon

Encoding Integrated Decision and Control for Autonomous Driving with Mixed Traffic Flow

Oct 24, 2021
Yangang Ren, Jianhua Jiang, Jingliang Duan, Shengbo Eben Li, Dongjie Yu, Guojian Zhan

Figure 1 for Encoding Integrated Decision and Control for Autonomous Driving with Mixed Traffic Flow
Figure 2 for Encoding Integrated Decision and Control for Autonomous Driving with Mixed Traffic Flow
Figure 3 for Encoding Integrated Decision and Control for Autonomous Driving with Mixed Traffic Flow
Figure 4 for Encoding Integrated Decision and Control for Autonomous Driving with Mixed Traffic Flow

Reinforcement learning (RL) has been widely adopted to make intelligent driving policy in autonomous driving due to the self-evolution ability and humanoid learning paradigm. Despite many elegant demonstrations of RL-enabled decision-making, current research mainly focuses on the pure vehicle driving environment while ignoring other traffic participants like bicycles and pedestrians. For urban roads, the interaction of mixed traffic flows leads to a quite dynamic and complex relationship, which poses great difficulty to learn a safe and intelligent policy. This paper proposes the encoding integrated decision and control (E-IDC) to handle complicated driving tasks with mixed traffic flows, which composes of an encoding function to construct driving states, a value function to choose the optimal path as well as a policy function to output the control command of ego vehicle. Specially, the encoding function is capable of dealing with different types and variant number of traffic participants and extracting features from original driving observation. Next, we design the training principle for the functions of E-IDC with RL algorithms by adding the gradient-based update rules and refine the safety constraints concerning the otherness of different participants. The verification is conducted on the intersection scenario with mixed traffic flows and result shows that E-IDC can enhance the driving performance, including the tracking performance and safety constraint requirements with a large margin. The online application indicates that E-IDC can realize efficient and smooth driving in the complex intersection, guaranteeing the intelligence and safety simultaneously.

Viaarxiv icon

Encoding Distributional Soft Actor-Critic for Autonomous Driving in Multi-lane Scenarios

Sep 12, 2021
Jingliang Duan, Yangang Ren, Fawang Zhang, Yang Guan, Dongjie Yu, Shengbo Eben Li, Bo Cheng, Lin Zhao

Figure 1 for Encoding Distributional Soft Actor-Critic for Autonomous Driving in Multi-lane Scenarios
Figure 2 for Encoding Distributional Soft Actor-Critic for Autonomous Driving in Multi-lane Scenarios
Figure 3 for Encoding Distributional Soft Actor-Critic for Autonomous Driving in Multi-lane Scenarios
Figure 4 for Encoding Distributional Soft Actor-Critic for Autonomous Driving in Multi-lane Scenarios

In this paper, we propose a new reinforcement learning (RL) algorithm, called encoding distributional soft actor-critic (E-DSAC), for decision-making in autonomous driving. Unlike existing RL-based decision-making methods, E-DSAC is suitable for situations where the number of surrounding vehicles is variable and eliminates the requirement for manually pre-designed sorting rules, resulting in higher policy performance and generality. We first develop an encoding distributional policy iteration (DPI) framework by embedding a permutation invariant module, which employs a feature neural network (NN) to encode the indicators of each vehicle, in the distributional RL framework. The proposed DPI framework is proved to exhibit important properties in terms of convergence and global optimality. Next, based on the developed encoding DPI framework, we propose the E-DSAC algorithm by adding the gradient-based update rule of the feature NN to the policy evaluation process of the DSAC algorithm. Then, the multi-lane driving task and the corresponding reward function are designed to verify the effectiveness of the proposed algorithm. Results show that the policy learned by E-DSAC can realize efficient, smooth, and relatively safe autonomous driving in the designed scenario. And the final policy performance learned by E-DSAC is about three times that of DSAC. Furthermore, its effectiveness has also been verified in real vehicle experiments.

Viaarxiv icon

Model-based Chance-Constrained Reinforcement Learning via Separated Proportional-Integral Lagrangian

Aug 26, 2021
Baiyu Peng, Jingliang Duan, Jianyu Chen, Shengbo Eben Li, Genjin Xie, Congsheng Zhang, Yang Guan, Yao Mu, Enxin Sun

Figure 1 for Model-based Chance-Constrained Reinforcement Learning via Separated Proportional-Integral Lagrangian
Figure 2 for Model-based Chance-Constrained Reinforcement Learning via Separated Proportional-Integral Lagrangian
Figure 3 for Model-based Chance-Constrained Reinforcement Learning via Separated Proportional-Integral Lagrangian
Figure 4 for Model-based Chance-Constrained Reinforcement Learning via Separated Proportional-Integral Lagrangian

Safety is essential for reinforcement learning (RL) applied in the real world. Adding chance constraints (or probabilistic constraints) is a suitable way to enhance RL safety under uncertainty. Existing chance-constrained RL methods like the penalty methods and the Lagrangian methods either exhibit periodic oscillations or learn an over-conservative or unsafe policy. In this paper, we address these shortcomings by proposing a separated proportional-integral Lagrangian (SPIL) algorithm. We first review the constrained policy optimization process from a feedback control perspective, which regards the penalty weight as the control input and the safe probability as the control output. Based on this, the penalty method is formulated as a proportional controller, and the Lagrangian method is formulated as an integral controller. We then unify them and present a proportional-integral Lagrangian method to get both their merits, with an integral separation technique to limit the integral value in a reasonable range. To accelerate training, the gradient of safe probability is computed in a model-based manner. We demonstrate our method can reduce the oscillations and conservatism of RL policy in a car-following simulation. To prove its practicality, we also apply our method to a real-world mobile robot navigation task, where our robot successfully avoids a moving obstacle with highly uncertain or even aggressive behaviors.

Viaarxiv icon

Fixed-Dimensional and Permutation Invariant State Representation of Autonomous Driving

May 24, 2021
Jingliang Duan, Dongjie Yu, Shengbo Eben Li, Wenxuan Wang, Yangang Ren, Ziyu Lin, Bo Cheng

Figure 1 for Fixed-Dimensional and Permutation Invariant State Representation of Autonomous Driving
Figure 2 for Fixed-Dimensional and Permutation Invariant State Representation of Autonomous Driving
Figure 3 for Fixed-Dimensional and Permutation Invariant State Representation of Autonomous Driving
Figure 4 for Fixed-Dimensional and Permutation Invariant State Representation of Autonomous Driving

In this paper, we propose a new state representation method, called encoding sum and concatenation (ESC), for the state representation of decision-making in autonomous driving. Unlike existing state representation methods, ESC is applicable to a variable number of surrounding vehicles and eliminates the need for manually pre-designed sorting rules, leading to higher representation ability and generality. The proposed ESC method introduces a representation neural network (NN) to encode each surrounding vehicle into an encoding vector, and then adds these vectors to obtain the representation vector of the set of surrounding vehicles. By concatenating the set representation with other variables, such as indicators of the ego vehicle and road, we realize the fixed-dimensional and permutation invariant state representation. This paper has further proved that the proposed ESC method can realize the injective representation if the output dimension of the representation NN is greater than the number of variables of all surrounding vehicles. This means that by taking the ESC representation as policy inputs, we can find the nearly optimal representation NN and policy NN by simultaneously optimizing them using gradient-based updating. Experiments demonstrate that compared with the fixed-permutation representation method, the proposed method improves the representation ability of the surrounding vehicles, and the corresponding approximation error is reduced by 62.2%.

Viaarxiv icon