With the breakthrough of AlphaGo, deep reinforcement learning becomes a recognized technique for solving sequential decision-making problems. Despite its reputation, data inefficiency caused by its trial and error learning mechanism makes deep reinforcement learning hard to be practical in a wide range of areas. Plenty of methods have been developed for sample efficient deep reinforcement learning, such as environment modeling, experience transfer, and distributed modifications, amongst which, distributed deep reinforcement learning has shown its potential in various applications, such as human-computer gaming, and intelligent transportation. In this paper, we conclude the state of this exciting field, by comparing the classical distributed deep reinforcement learning methods, and studying important components to achieve efficient distributed learning, covering single player single agent distributed deep reinforcement learning to the most complex multiple players multiple agents distributed deep reinforcement learning. Furthermore, we review recently released toolboxes that help to realize distributed deep reinforcement learning without many modifications of their non-distributed versions. By analyzing their strengths and weaknesses, a multi-player multi-agent distributed deep reinforcement learning toolbox is developed and released, which is further validated on Wargame, a complex environment, showing usability of the proposed toolbox for multiple players and multiple agents distributed deep reinforcement learning under complex games. Finally, we try to point out challenges and future trends, hoping this brief review can provide a guide or a spark for researchers who are interested in distributed deep reinforcement learning.
The unaffordable computation load of nonlinear model predictive control (NMPC) has prevented it for being used in robots with high sampling rates for decades. This paper is concerned with the policy learning problem for nonlinear MPC with system constraints, and its applications to unmanned surface vehicles (USVs), where the nonlinear MPC policy is learned offline and deployed online to resolve the computational complexity issue. A deep neural networks (DNN) based policy learning MPC (PL-MPC) method is proposed to avoid solving nonlinear optimal control problems online. The detailed policy learning method is developed and the PL-MPC algorithm is designed. The strategy to ensure the practical feasibility of policy implementation is proposed, and it is theoretically proved that the closed-loop system under the proposed method is asymptotically stable in probability. In addition, we apply the PL-MPC algorithm successfully to the motion control of USVs. It is shown that the proposed algorithm can be implemented at a sampling rate up to $5 Hz$ with high-precision motion control. The experiment video is available via:\url{https://v.youku.com/v_show/id_XNTkwMTM0NzM5Ng==.html
Humans can balance very well during walking, even when perturbed. But it seems difficult to achieve robust walking for bipedal robots. Here we describe the simplest balance controller that leads to robust walking for a linear inverted pendulum (LIP) model. The main idea is to use a linear function of the body velocity to determine the next foot placement, which we call linear foot placement control (LFPC). By using the Poincar\'e map, a balance criterion is derived, which shows that LFPC is stable when the velocity-feedback coefficient is located in a certain range. And that range is much bigger when stepping faster, which indicates "faster stepping, easier to balance". We show that various gaits can be generated by adjusting the controller parameters in LFPC. Particularly, a dead-beat controller is discovered that can lead to steady-state walking in just one step. The effectiveness of LFPC is verified through Matlab simulation as well as V-REP simulation for both 2D and 3D walking. The main feature of LFPC is its simplicity and inherent robustness, which may help us understand the essence of how to maintain balance in dynamic walking.
This paper presents an efficient and safe method to avoid static and dynamic obstacles based on LiDAR. First, point cloud is used to generate a real-time local grid map for obstacle detection. Then, obstacles are clustered by DBSCAN algorithm and enclosed with minimum bounding ellipses (MBEs). In addition, data association is conducted to match each MBE with the obstacle in the current frame. Considering MBE as an observation, Kalman filter (KF) is used to estimate and predict the motion state of the obstacle. In this way, the trajectory of each obstacle in the forward time domain can be parameterized as a set of ellipses. Due to the uncertainty of the MBE, the semi-major and semi-minor axes of the parameterized ellipse are extended to ensure safety. We extend the traditional Control Barrier Function (CBF) and propose Dynamic Control Barrier Function (D-CBF). We combine D-CBF with Model Predictive Control (MPC) to implement safety-critical dynamic obstacle avoidance. Experiments in simulated and real scenarios are conducted to verify the effectiveness of our algorithm. The source code is released for the reference of the community.
Offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions. This setting is particularly well-suited for continuous control robotic applications for which online data collection based on trial-and-error is costly and potentially unsafe. In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios, such as data from several human demonstrators or from policies that act with different purposes. Unfortunately, such datasets can exacerbate the distribution shift between the behavior policy underlying the data and the optimal policy to be learned, leading to poor performance. To address this challenge, we propose to leverage latent-variable policies that can represent a broader class of policy distributions, leading to better adherence to the training data distribution while maximizing reward via a policy over the latent variable. As we empirically show on a range of simulated locomotion, navigation, and manipulation tasks, our method referred to as latent-variable advantage-weighted policy optimization (LAPO), improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets, and by 8% on datasets with narrow and biased distributions.
Autonomous navigation of ground robots has been widely used in indoor structured 2D environments, but there are still many challenges in outdoor 3D unstructured environments, especially in rough, uneven terrains. This paper proposed a plane-fitting based uneven terrain navigation framework (PUTN) to solve this problem. The implementation of PUTN is divided into three steps. First, based on Rapidly-exploring Random Trees (RRT), an improved sample-based algorithm called Plane Fitting RRT* (PF-RRT*) is proposed to obtain a sparse trajectory. Each sampling point corresponds to a custom traversability index and a fitted plane on the point cloud. These planes are connected in series to form a traversable strip. Second, Gaussian Process Regression is used to generate traversability of the dense trajectory interpolated from the sparse trajectory, and the sampling tree is used as the training set. Finally, local planning is performed using nonlinear model predictive control (NMPC). By adding the traversability index and uncertainty to the cost function, and adding obstacles generated by the real-time point cloud to the constraint function, a safe motion planning algorithm with smooth speed and strong robustness is available. Experiments in real scenarios are conducted to verify the effectiveness of the method.
A quadrupedal guidance robot that can guide people and avoid various obstacles, could potentially be owned by more visually impaired people at a fairly low cost. In this paper, we propose a novel guidance robot system with a comfort-based concept. We design a leash containing an elastic rope and a thin string, and use a motor to adjust the length of the string to ensure comfort. We use the force-based human motion model to plan the forces experienced by the human. Afterward, the direction and magnitude of the force are controlled by the motion of the robot, and the rotation of the motor, respectively. This allows humans to be guided safely and more comfortably to the target position in complex environments. The system has been deployed on Unitree Laikago quadrupedal platform and validated in real-world scenarios.
The integration of Reinforcement Learning (RL) and Evolutionary Algorithms (EAs) aims at simultaneously exploiting the sample efficiency as well as the diversity and robustness of the two paradigms. Recently, hybrid learning frameworks based on this principle have achieved great success in various challenging robot control tasks. However, in these methods, policies from the genetic population are evaluated via interactions with the real environments, limiting their applicability in computationally expensive problems. In this work, we propose Surrogate-assisted Controller (SC), a novel and efficient module that can be integrated into existing frameworks to alleviate the computational burden of EAs by partially replacing the expensive policy evaluation. The key challenge in applying this module is to prevent the optimization process from being misled by the possible false minima introduced by the surrogate. To address this issue, we present two strategies for SC to control the workflow of hybrid frameworks. Experiments on six continuous control tasks from the OpenAI Gym platform show that SC can not only significantly reduce the cost of fitness evaluations, but also boost the performance of the original hybrid frameworks with collaborative learning and evolutionary processes.
Imitation Learning (IL) is an effective learning paradigm exploiting the interactions between agents and environments. It does not require explicit reward signals and instead tries to recover desired policies using expert demonstrations. In general, IL methods can be categorized into Behavioral Cloning (BC) and Inverse Reinforcement Learning (IRL). In this work, a novel reward function based on probability density estimation is proposed for IRL, which can significantly reduce the complexity of existing IRL methods. Furthermore, we prove that the theoretically optimal policy derived from our reward function is identical to the expert policy as long as it is deterministic. Consequently, an IRL problem can be gracefully transformed into a probability density estimation problem. Based on the proposed reward function, we present a "watch-try-learn" style framework named Probability Density Estimation based Imitation Learning (PDEIL), which can work in both discrete and continuous action spaces. Finally, comprehensive experiments in the Gym environment show that PDEIL is much more efficient than existing algorithms in recovering rewards close to the ground truth.
With breakthrough of AlphaGo, AI in human-computer game has become a very hot topic attracting researchers all around the world, which usually serves as an effective standard for testing artificial intelligence. Various game AI systems (AIs) have been developed such as Libratus, OpenAI Five and AlphaStar, beating professional human players. In this paper, we survey recent successful game AIs, covering board game AIs, card game AIs, first-person shooting game AIs and real time strategy game AIs. Through this survey, we 1) compare the main difficulties among different kinds of games for the intelligent decision making field ; 2) illustrate the mainstream frameworks and techniques for developing professional level AIs; 3) raise the challenges or drawbacks in the current AIs for intelligent decision making; and 4) try to propose future trends in the games and intelligent decision making techniques. Finally, we hope this brief review can provide an introduction for beginners, inspire insights for researchers in the filed of AI in games.