Alert button
Picture for Shengbo Eben Li

Shengbo Eben Li

Alert button

Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization

Mar 03, 2020
Lu Wen, Jingliang Duan, Shengbo Eben Li, Shaobing Xu, Huei Peng

Figure 1 for Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization
Figure 2 for Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization
Figure 3 for Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization
Figure 4 for Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization

Reinforcement learning (RL) is attracting increasing interests in autonomous driving due to its potential to solve complex classification and control problems. However, existing RL algorithms are rarely applied to real vehicles for two predominant problems: behaviours are unexplainable, and they cannot guarantee safety under new scenarios. This paper presents a safe RL algorithm, called Parallel Constrained Policy Optimization (PCPO), for two autonomous driving tasks. PCPO extends today's common actor-critic architecture to a three-component learning framework, in which three neural networks are used to approximate the policy function, value function and a newly added risk function, respectively. Meanwhile, a trust region constraint is added to allow large update steps without breaking the monotonic improvement condition. To ensure the feasibility of safety constrained problems, synchronized parallel learners are employed to explore different state spaces, which accelerates learning and policy-update. The simulations of two scenarios for autonomous vehicles confirm we can ensure safety while achieving fast learning.

Viaarxiv icon

Mixed Reinforcement Learning with Additive Stochastic Uncertainty

Feb 28, 2020
Yao Mu, Shengbo Eben Li, Chang Liu, Qi Sun, Bingbing Nie, Bo Cheng, Baiyu Peng

Figure 1 for Mixed Reinforcement Learning with Additive Stochastic Uncertainty
Figure 2 for Mixed Reinforcement Learning with Additive Stochastic Uncertainty
Figure 3 for Mixed Reinforcement Learning with Additive Stochastic Uncertainty
Figure 4 for Mixed Reinforcement Learning with Additive Stochastic Uncertainty

Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency. This paper presents a mixed reinforcement learning (mixed RL) algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy with the purpose of improving both learning accuracy and training speed. The dual representations indicate the environmental model and the state-action data: the former can accelerate the learning process of RL, while its inherent model uncertainty generally leads to worse policy accuracy than the latter, which comes from direct measurements of states and actions. In the framework design of the mixed RL, the compensation of the additive stochastic model uncertainty is embedded inside the policy iteration RL framework by using explored state-action data via iterative Bayesian estimator (IBE). The optimal policy is then computed in an iterative way by alternating between policy evaluation (PEV) and policy improvement (PIM). The convergence of the mixed RL is proved using the Bellman's principle of optimality, and the recursive stability of the generated policy is proved via the Lyapunov's direct method. The effectiveness of the mixed RL is demonstrated by a typical optimal control problem of stochastic non-affine nonlinear systems (i.e., double lane change task with an automated vehicle).

Viaarxiv icon

Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

Feb 23, 2020
Jingliang Duan, Yang Guan, Shengbo Eben Li, Yangang Ren, Bo Cheng

Figure 1 for Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors
Figure 2 for Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors
Figure 3 for Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors
Figure 4 for Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors

In current reinforcement learning (RL) methods, function approximation errors are known to lead to the overestimated or underestimated Q-value estimates, thus resulting in suboptimal policies. We show that the learning of a state-action return distribution function can be used to improve the Q-value estimation accuracy. We employ the return distribution function within the maximum entropy RL framework in order to develop what we call the Distributional Soft Actor-Critic (DSAC) algorithm, which is an off-policy method for continuous control setting. Unlike traditional distributional RL algorithms which typically only learn a discrete return distribution, DSAC directly learns a continuous return distribution by truncating the difference between the target and current distribution to prevent gradient explosion. Additionally, we propose a new Parallel Asynchronous Buffer-Actor-Learner architecture (PABAL) to improve the learning efficiency, which is a generalization of current high-throughput learning architectures. We evaluate our method on the suite of MuJoCo continuous control tasks, achieving state-of-the-art performance.

Viaarxiv icon

Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic

Feb 13, 2020
Yangang Ren, Jingliang Duan, Yang Guan, Shengbo Eben Li

Figure 1 for Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic
Figure 2 for Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic
Figure 3 for Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic
Figure 4 for Improving Generalization of Reinforcement Learning with Minimax Distributional Soft Actor-Critic

Reinforcement learning (RL) has achieved remarkable performance in a variety of sequential decision making and control tasks. However, a common problem is that learned nearly optimal policy always overfits to the training environment and may not be extended to situations never encountered during training. For practical applications, the randomness of the environment usually leads to rare but devastating events, which should be the focus of safety-critical systems, such as autonomous driving. In this paper, we introduce the minimax formulation and distributional framework to improve the generalization ability of RL algorithms and develop the Minimax Distributional Soft Actor-Critic (Minimax DSAC) algorithm. Minimax formulation aims to seek optimal policy considering the most serious disturbances from environment, in which the protagonist policy maximizes action-value function while the adversary policy tries to minimize it. Distributional framework aims to learn a state-action return distribution, from which we can model the risk of different returns explicitly, thus, formulating a risk-averse protagonist policy and a risk-seeking adversarial policy. We implement our method on the decision-making tasks of autonomous vehicles at intersections and test the trained policy in distinct environments from training environment. Results demonstrate that our method can greatly improve the generalization ability of the protagonist agent to different environmental variations.

Viaarxiv icon

Interpretable End-to-end Urban Autonomous Driving with Latent Deep Reinforcement Learning

Jan 23, 2020
Jianyu Chen, Shengbo Eben Li, Masayoshi Tomizuka

Figure 1 for Interpretable End-to-end Urban Autonomous Driving with Latent Deep Reinforcement Learning
Figure 2 for Interpretable End-to-end Urban Autonomous Driving with Latent Deep Reinforcement Learning
Figure 3 for Interpretable End-to-end Urban Autonomous Driving with Latent Deep Reinforcement Learning
Figure 4 for Interpretable End-to-end Urban Autonomous Driving with Latent Deep Reinforcement Learning

Unlike popular modularized framework, end-to-end autonomous driving seeks to solve the perception, decision and control problems in an integrated way, which can be more adapting to new scenarios and easier to generalize at scale. However, existing end-to-end approaches are often lack of interpretability, and can only deal with simple driving tasks like lane keeping. In this paper, we propose an interpretable deep reinforcement learning method for end-to-end autonomous driving, which is able to handle complex urban scenarios. A sequential latent environment model is introduced and learned jointly with the reinforcement learning process. With this latent model, a semantic birdeye mask can be generated, which is enforced to connect with a certain intermediate property in today's modularized framework for the purpose of explaining the behaviors of learned policy. The latent space also significantly reduces the sample complexity of reinforcement learning. Comparison tests with a simulated autonomous car in CARLA show that the performance of our method in urban scenarios with crowded surrounding vehicles dominates many baselines including DQN, DDPG, TD3 and SAC. Moreover, through masked outputs, the learned policy is able to provide a better explanation of how the car reasons about the driving environment.

Viaarxiv icon

Addressing Value Estimation Errors in Reinforcement Learning with a State-Action Return Distribution Function

Jan 09, 2020
Jingliang Duan, Yang Guan, Yangang Ren, Shengbo Eben Li, Bo Cheng

Figure 1 for Addressing Value Estimation Errors in Reinforcement Learning with a State-Action Return Distribution Function
Figure 2 for Addressing Value Estimation Errors in Reinforcement Learning with a State-Action Return Distribution Function
Figure 3 for Addressing Value Estimation Errors in Reinforcement Learning with a State-Action Return Distribution Function
Figure 4 for Addressing Value Estimation Errors in Reinforcement Learning with a State-Action Return Distribution Function

In current reinforcement learning (RL) methods, function approximation errors are known to lead to the overestimated or underestimated state-action values Q, which further lead to suboptimal policies. We show that the learning of a state-action return distribution function can be used to improve the estimation accuracy of the Q-value. We combine the distributional return function within the maximum entropy RL framework in order to develop what we call the Distributional Soft Actor-Critic algorithm, DSAC, which is an off-policy method for continuous control setting. Unlike traditional distributional Q algorithms which typically only learn a discrete return distribution, DSAC can directly learn a continuous return distribution by truncating the difference between the target and current return distribution to prevent gradient explosion. Additionally, we propose a new Parallel Asynchronous Buffer-Actor-Learner architecture (PABAL) to improve the learning efficiency. We evaluate our method on the suite of MuJoCo continuous control tasks, achieving the state of the art performance.

Viaarxiv icon

Direct and indirect reinforcement learning

Dec 23, 2019
Yang Guan, Shengbo Eben Li, Jingliang Duan, Jie Li, Yangang Ren, Bo Cheng

Figure 1 for Direct and indirect reinforcement learning
Figure 2 for Direct and indirect reinforcement learning
Figure 3 for Direct and indirect reinforcement learning
Figure 4 for Direct and indirect reinforcement learning

Reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision making and control tasks. In this paper, we classify RL into direct and indirect methods according to how they seek optimal policy of the Markov Decision Process (MDP) problem. The former solves optimal policy by directly maximizing an objective function using gradient descent method, in which the objective function is usually the expectation of accumulative future rewards. The latter indirectly finds the optimal policy by solving the Bellman equation, which is the sufficient and necessary condition from Bellman's principle of optimality. We take vanilla policy gradient and approximate policy iteration to study their internal relationship, and reveal that both direct and indirect methods can be unified in actor-critic architecture and are equivalent if we always choose stationary state distribution of current policy as initial state distribution of MDP. Finally, we classify the current mainstream RL algorithms and compare the differences between other criteria including value-based and policy-based, model-based and model-free.

Viaarxiv icon

Centralized Conflict-free Cooperation for Connected and Automated Vehicles at Intersections by Proximal Policy Optimization

Dec 18, 2019
Yang Guan, Yangang Ren, Shengbo Eben Li, Qi Sun, Laiquan Luo, Koji Taguchi, Keqiang Li

Figure 1 for Centralized Conflict-free Cooperation for Connected and Automated Vehicles at Intersections by Proximal Policy Optimization
Figure 2 for Centralized Conflict-free Cooperation for Connected and Automated Vehicles at Intersections by Proximal Policy Optimization
Figure 3 for Centralized Conflict-free Cooperation for Connected and Automated Vehicles at Intersections by Proximal Policy Optimization
Figure 4 for Centralized Conflict-free Cooperation for Connected and Automated Vehicles at Intersections by Proximal Policy Optimization

Connected vehicles will change the modes of future transportation management and organization, especially at intersections. There are mainly two categories coordination methods at unsignalized intersection, i.e. centralized and distributed methods. Centralized coordination methods need huge computation resources since they own a centralized controller to optimize the trajectories for all approaching vehicles, while in distributed methods each approaching vehicles owns an individual controller to optimize the trajectory considering the motion information and the conflict relationship with its neighboring vehicles, which avoids huge computation but needs sophisticated manual design. In this paper, we propose a centralized conflict-free cooperation method for multiple connected vehicles at unsignalized intersection using reinforcement learning (RL) to address computation burden naturally by training offline. We firstly incorporate a prior model into proximal policy optimization (PPO) algorithm to accelerate learning process. Then we present the design of state, action and reward to formulate centralized cooperation as RL problem. Finally, we train a coordinate policy by our model-accelerated PPO (MA-PPO) in a simulation setting and analyze results. Results show that the method we propose improves the traffic efficiency of the intersection on the premise of ensuring no collision.

Viaarxiv icon

Deep adaptive dynamic programming for nonaffine nonlinear optimal control problem with state constraints

Nov 26, 2019
Jingliang Duan, Zhengyu Liu, Shengbo Eben Li, Qi Sun, Zhenzhong Jia, Bo Cheng

Figure 1 for Deep adaptive dynamic programming for nonaffine nonlinear optimal control problem with state constraints
Figure 2 for Deep adaptive dynamic programming for nonaffine nonlinear optimal control problem with state constraints
Figure 3 for Deep adaptive dynamic programming for nonaffine nonlinear optimal control problem with state constraints
Figure 4 for Deep adaptive dynamic programming for nonaffine nonlinear optimal control problem with state constraints

This paper presents a constrained deep adaptive dynamic programming (CDADP) algorithm to solve general nonlinear optimal control problems with known dynamics. Unlike previous ADP algorithms, it can directly deal with problems with state constraints. Both the policy and value function are approximated by deep neural networks (NNs), which directly map the system state to action and value function respectively without needing to use hand-crafted basis function. The proposed algorithm considers the state constraints by transforming the policy improvement process to a constrained optimization problem. Meanwhile, a trust region constraint is added to prevent excessive policy update. We first linearize this constrained optimization problem locally into a quadratically-constrained quadratic programming problem, and then obtain the optimal update of policy network parameters by solving its dual problem. We also propose a series of recovery rules to update the policy in case the primal problem is infeasible. In addition, parallel learners are employed to explore different state spaces and then stabilize and accelerate the learning speed. The vehicle control problem in path-tracking task is used to demonstrate the effectiveness of this proposed method.

Viaarxiv icon

Generalized Policy Iteration for Optimal Control in Continuous Time

Sep 11, 2019
Jingliang Duan, Shengbo Eben Li, Zhengyu Liu, Monimoy Bujarbaruah, Bo Cheng

Figure 1 for Generalized Policy Iteration for Optimal Control in Continuous Time
Figure 2 for Generalized Policy Iteration for Optimal Control in Continuous Time
Figure 3 for Generalized Policy Iteration for Optimal Control in Continuous Time
Figure 4 for Generalized Policy Iteration for Optimal Control in Continuous Time

This paper proposes the Deep Generalized Policy Iteration (DGPI) algorithm to find the infinite horizon optimal control policy for general nonlinear continuous-time systems with known dynamics. Unlike existing adaptive dynamic programming algorithms for continuous time systems, DGPI does not require the admissibility of initialized policy, and input-affine nature of controlled systems for convergence. Our algorithm employs the actor-critic architecture to approximate both policy and value functions with the purpose of iteratively solving the Hamilton-Jacobi-Bellman equation. Both the policy and value functions are approximated by deep neural networks. Given any arbitrary initial policy, the proposed DGPI algorithm can eventually converge to an admissible, and subsequently an optimal policy for an arbitrary nonlinear system. We also relax the update termination conditions of both the policy evaluation and improvement processes, which leads to a faster convergence speed than conventional Policy Iteration (PI) methods, for the same architecture of function approximators. We further prove the convergence and optimality of the algorithm with thorough Lyapunov analysis, and demonstrate its generality and efficacy using two detailed numerical examples.

Viaarxiv icon