As reinforcement learning (RL) has achieved near human-level performance in a variety of tasks, its robustness has raised great attention. While a vast body of research has explored test-time (evasion) attacks in RL and corresponding defenses, its robustness against training-time (poisoning) attacks remains largely unanswered. In this work, we focus on certifying the robustness of offline RL in the presence of poisoning attacks, where a subset of training trajectories could be arbitrarily manipulated. We propose the first certification framework, COPA, to certify the number of poisoning trajectories that can be tolerated regarding different certification criteria. Given the complex structure of RL, we propose two certification criteria: per-state action stability and cumulative reward bound. To further improve the certification, we propose new partition and aggregation protocols to train robust policies. We further prove that some of the proposed certification methods are theoretically tight and some are NP-Complete problems. We leverage COPA to certify three RL environments trained with different algorithms and conclude: (1) The proposed robust aggregation protocols such as temporal aggregation can significantly improve the certifications; (2) Our certification for both per-state action stability and cumulative reward bound are efficient and tight; (3) The certification for different training algorithms and environments are different, implying their intrinsic robustness properties. All experimental results are available at https://copa-leaderboard.github.io.
Robust Reinforcement Learning (RL) focuses on improving performances under model errors or adversarial attacks, which facilitates the real-life deployment of RL agents. Robust Adversarial Reinforcement Learning (RARL) is one of the most popular frameworks for robust RL. However, most of the existing literature models RARL as a zero-sum simultaneous game with Nash equilibrium as the solution concept, which could overlook the sequential nature of RL deployments, produce overly conservative agents, and induce training instability. In this paper, we introduce a novel hierarchical formulation of robust RL - a general-sum Stackelberg game model called RRL-Stack - to formalize the sequential nature and provide extra flexibility for robust training. We develop the Stackelberg Policy Gradient algorithm to solve RRL-Stack, leveraging the Stackelberg learning dynamics by considering the adversary's response. Our method generates challenging yet solvable adversarial environments which benefit RL agents' robust learning. Our algorithm demonstrates better training stability and robustness against different testing conditions in the single-agent robotics control and multi-agent highway merging tasks.
Autonomous driving systems have witnessed a great development during the past years thanks to the advance in sensing and decision-making. One critical obstacle for their massive deployment in the real world is the evaluation of safety. Most existing driving systems are still trained and evaluated on naturalistic scenarios that account for the vast majority of daily life or heuristically-generated adversarial ones. However, the large population of cars requires an extremely low collision rate, indicating safety-critical scenarios collected in the real world would be rare. Thus, methods to artificially generate artificial scenarios becomes critical to manage the risk and reduce the cost. In this survey, we focus on the algorithms of safety-critical scenario generation. We firstly provide a comprehensive taxonomy of existing algorithms by dividing them into three categories: data-driven generation, adversarial generation, and knowledge-based generation. Then, we discuss useful tools for scenario generation, including simulation platforms and packages. Finally, we extend our discussion to five main challenges of current works -- fidelity, efficiency, diversity, transferability, controllability -- and the research opportunities lighted up by these challenges.
Full-field traffic state information (i.e., flow, speed, and density) is critical for the successful operation of Intelligent Transportation Systems (ITS) on freeways. However, incomplete traffic information tends to be directly collected from traffic detectors that are insufficiently installed in most areas, which is a major obstacle to the popularization of ITS. To tackle this issue, this paper introduces an innovative traffic state estimation (TSE) framework that hybrid regression machine learning techniques (e.g., artificial neural network (ANN), random forest (RF), and support vector machine (SVM)) with a traffic physics model (e.g., second-order macroscopic traffic flow model) using limited information from traffic sensors as inputs to construct accurate and full-field estimated traffic state for freeway systems. To examine the effectiveness of the proposed TSE framework, this paper conducted empirical studies on a real-world data set collected from a stretch of I-15 freeway in Salt Lake City, Utah. Experimental results show that the proposed method has been proved to estimate full-field traffic information accurately. Hence, the proposed method could provide accurate and full-field traffic information, thus providing the basis for the popularization of ITS.
Safe reinforcement learning (RL) aims to learn policies that satisfy certain constraints before deploying to safety-critical applications. Primal-dual as a prevalent constrained optimization framework suffers from instability issues and lacks optimality guarantees. This paper overcomes the issues from a novel probabilistic inference perspective and proposes an Expectation-Maximization style approach to learn safe policy. We show that the safe RL problem can be decomposed to 1) a convex optimization phase with a non-parametric variational distribution and 2) a supervised learning phase. We show the unique advantages of constrained variational policy optimization by proving its optimality and policy improvement stability. A wide range of experiments on continuous robotic tasks show that the proposed method achieves significantly better performance in terms of constraint satisfaction and sample efficiency than primal-dual baselines.
In this paper, we focus on a new method of data augmentation to solve the data imbalance problem within imbalanced ECG datasets to improve the robustness and accuracy of heart disease detection. By using Optimal Transport, we augment the ECG disease data from normal ECG beats to balance the data among different categories. We build a Multi-Feature Transformer (MF-Transformer) as our classification model, where different features are extracted from both time and frequency domains to diagnose various heart conditions. Learning from 12-lead ECG signals, our model is able to distinguish five categories of cardiac conditions. Our results demonstrate 1) the classification models' ability to make competitive predictions on five ECG categories; 2) improvements in accuracy and robustness reflecting the effectiveness of our data augmentation method.
This letter studies an intelligent reflecting surface (IRS)-assisted integrated sensing and communication (ISAC) system, in which one IRS is deployed to not only assist the wireless communication from a multi-antenna base station (BS) to a single-antenna communication user (CU), but also create virtual line-of-sight (LoS) links for sensing targets at areas with LoS links blocked. We consider that the BS transmits combined information and sensing signals for ISAC. Under this setup, we jointly optimize the transmit information and sensing beamforming at the BS and the reflective beamforming at the IRS, to maximize the IRS's minimum beampattern gain towards the desired sensing angles, subject to the minimum signal-to-noise ratio (SNR) requirement at the CU and the maximum transmit power constraint at the BS. Although the formulated SNR-constrained beampattern gain maximization problem is non-convex and difficult to solve, we present an efficient algorithm to obtain a high-quality solution using alternating optimization and semi-definite relaxation (SDR). Numerical results show that the proposed algorithm achieves improved sensing performance while ensuring the communication requirement.
Rare-event simulation techniques, such as importance sampling (IS), constitute powerful tools to speed up challenging estimation of rare catastrophic events. These techniques often leverage the knowledge and analysis on underlying system structures to endow desirable efficiency guarantees. However, black-box problems, especially those arising from recent safety-critical applications of AI-driven physical systems, can fundamentally undermine their efficiency guarantees and lead to dangerous under-estimation without diagnostically detected. We propose a framework called Deep Probabilistic Accelerated Evaluation (Deep-PrAE) to design statistically guaranteed IS, by converting black-box samplers that are versatile but could lack guarantees, into one with what we call a relaxed efficiency certificate that allows accurate estimation of bounds on the rare-event probability. We present the theory of Deep-PrAE that combines the dominating point concept with rare-event set learning via deep neural network classifiers, and demonstrate its effectiveness in numerical examples including the safety-testing of intelligent driving algorithms.
Goal-directed generation, aiming for solving downstream tasks by generating diverse data, has a potentially wide range of applications in the real world. Previous works tend to formulate goal-directed generation as a purely data-driven problem, which directly searches or approximates the distribution of samples satisfying the goal. However, the generation ability of preexisting work is heavily restricted by inefficient sampling, especially for sparse goals that rarely show up in off-the-shelf datasets. For instance, generating safety-critical traffic scenes with the goal of increasing the risk of collision is critical to evaluate autonomous vehicles, but the rareness of such scenes is the biggest resistance. In this paper, we integrate causality as a prior into the safety-critical scene generation process and propose a flow-based generative framework - Causal Autoregressive Flow (CausalAF). CausalAF encourages the generative model to uncover and follow the causal relationship among generated objects via novel causal masking operations instead of searching the sample only from observational data. By learning the cause-and-effect mechanism of how the generated scene achieves the goal rather than just learning correlations from data, CausalAF significantly improves the learning efficiency. Extensive experiments on three heterogeneous traffic scenes illustrate that CausalAF requires much fewer optimization resources to effectively generate goal-directed scenes for safety evaluation tasks.
The evaluation of rare but high-stakes events remains one of the main difficulties in obtaining reliable policies from intelligent agents, especially in large or continuous state/action spaces where limited scalability enforces the use of a prohibitively large number of testing iterations. On the other hand, a biased or inaccurate policy evaluation in a safety-critical system could potentially cause unexpected catastrophic failures during deployment. In this paper, we propose the Accelerated Policy Evaluation (APE) method, which simultaneously uncovers rare events and estimates the rare event probability in Markov decision processes. The APE method treats the environment nature as an adversarial agent and learns towards, through adaptive importance sampling, the zero-variance sampling distribution for the policy evaluation. Moreover, APE is scalable to large discrete or continuous spaces by incorporating function approximators. We investigate the convergence properties of proposed algorithms under suitable regularity conditions. Our empirical studies show that APE estimates rare event probability with a smaller variance while only using orders of magnitude fewer samples compared to baseline methods in both multi-agent and single-agent environments.