This paper studies a two-player game with a quantitative surveillance requirement on an adversarial target moving in a discrete state space and a secondary objective to maximize short-term visibility of the environment. We impose the surveillance requirement as a temporal logic constraint.We then use a greedy approach to determine vantage points that optimize a notion of information gain, namely, the number of newly-seen states. By using a convolutional neural network trained on a class of environments, we can efficiently approximate the information gain at each potential vantage point.Subsequent vantage points are chosen such that moving to that location will not jeopardize the surveillance requirement, regardless of any future action chosen by the target. Our method combines guarantees of correctness from formal methods with the scalability of machine learning to provide an efficient approach for surveillance-constrained visibility optimization.
A shield is attached to a system to guarantee safety by correcting the system's behavior at runtime. Existing methods that employ design-time synthesis of shields do not scale to multi-agent systems. Moreover, such shields are typically implemented in a centralized manner, requiring global information on the state of all agents in the system. We address these limitations through a new approach where the shields are synthesized at runtime and do not require global information. There is a shield onboard every agent, which can only modify the behavior of the corresponding agent. In this approach, which is fundamentally decentralized, the shield on every agent has two components: a pathfinder that corrects the behavior of the agent and an ordering mechanism that dynamically modifies the priority of the agent. The current priority determines if the shield uses the pathfinder to modify behavior of the agent. We derive an upper bound on the maximum deviation for any agent from its original behavior. We prove that the worst-case synthesis time is quadratic in the number of agents at runtime as opposed to exponential at design-time for existing methods. We test the performance of the decentralized, runtime shield synthesis approach on a collision-avoidance problem. For 50 agents in a 50x50 grid, the synthesis at runtime requires a few seconds per agent whenever a potential collision is detected. In contrast, the centralized design-time synthesis of shields for a similar setting is intractable beyond 4 agents in a 5x5 grid.
Active perception strategies enable an agent to selectively gather information in a way to improve its performance. In applications in which the agent does not have prior knowledge about the available information sources, it is crucial to synthesize active perception strategies at runtime. We consider a setting in which at runtime an agent is capable of gathering information under a limited budget. We pose the problem in the context of partially observable Markov decision processes. We propose a generalized greedy strategy that selects a subset of information sources with near-optimality guarantees on uncertainty reduction. Our theoretical analysis establishes that the proposed active perception strategy achieves near-optimal performance in terms of expected cumulative reward. We demonstrate the resulting strategies in simulations on a robotic navigation problem.
Privacy is an important concern in various multiagent systems in which data collected from the agents are sensitive. We propose a differentially private controller synthesis approach for multi-agent systems subject to high-level specifications expressed in metric temporal logic (MTL). We consider a setting where each agent sends data to a cloud (computing station) through a set of local hubs and the cloud is responsible for computing the control inputs of the agents. Specifically, each agent adds privacy noise (e.g., Gaussian noise) point-wise in time to its own outputs before sharing them with a local hub. Each local hub runs a Kalman filter to estimate the state of the corresponding agent and periodically sends such state estimates to the cloud. The cloud computes the optimal inputs for each agent subject to an MTL specification. While guaranteeing differential privacy of each agent, the controller is also synthesized to ensure a probabilistic guarantee for satisfying the MTL specification.We provide an implementation of the proposed method on a simulation case study with two Baxter-On-Wheels robots as the agents.
A variety of queries about stochastic systems boil down to study of Markov chains and their properties. If the Markov chain is large, as is typically true for discretized continuous spaces, such analysis may be computationally intractable. Nevertheless, in many scenarios, Markov chains have underlying structural properties that allow them to admit a low-dimensional representation. For instance, the transition matrix associated with the model may be low-rank and hence, representable in a lower-dimensional space. We consider the problem of learning low-dimensional representations for large-scale Markov chains. To that end, we formulate the task of representation learning as that of mapping the state space of the model to a low-dimensional state space, referred to as the kernel space. The kernel space contains a set of meta states which are desired to be representative of only a small subset of original states. To promote this structural property, we constrain the number of nonzero entries of the mappings between the state space and the kernel space. By imposing the desired characteristics of the structured representation, we cast the problem as the task of nonnegative matrix factorization. To compute the solution, we propose an efficient block coordinate gradient descent and theoretically analyze its convergence properties. Our extensive simulation results demonstrate the efficacy of the proposed algorithm in terms of the quality of the low-dimensional representation as well as its computational cost.
This paper develops a controller synthesis approach for a multi-agent system (MAS) with intermittent communication. We adopt a leader-follower scheme, where a mobile leader with absolute position sensors switches among a set of followers without absolute position sensors to provide each follower with intermittent state information.We model the MAS as a switched system. The followers are to asymptotically reach a predetermined consensus state. To guarantee the stability of the switched system and the consensus of the followers, we derive maximum and minimal dwell-time conditions to constrain the intervals between consecutive time instants at which the leader should provide state information to the same follower. Furthermore, the leader needs to satisfy practical constraints such as charging its battery and staying in specific regions of interest. Both the maximum and minimum dwell-time conditions and these practical constraints can be expressed by metric temporal logic (MTL) specifications. We iteratively compute the optimal control inputs such that the leader satisfies the MTL specifications, while guaranteeing stability and consensus of the followers. We implement the proposed method on a case study with three mobile robots as the followers and one quadrotor as the leader.
Incorporating high-level knowledge is an effective way to expedite reinforcement learning (RL), especially for complex tasks with sparse rewards. We investigate an RL problem where the high-level knowledge is in the form of reward machines, i.e., a type of Mealy machine that encodes the reward functions. We focus on a setting in which this knowledge is a priori not available to the learning agent. We develop an iterative algorithm that performs joint inference of reward machines and policies for RL (more specifically, q-learning). In each iteration, the algorithm maintains a hypothesis reward machine and a sample of RL episodes. It derives q-functions from the current hypothesis reward machine, and performs RL to update the q-functions. While performing RL, the algorithm updates the sample by adding RL episodes along which the obtained rewards are inconsistent with the rewards based on the current hypothesis reward machine. In the next iteration, the algorithm infers a new hypothesis reward machine from the updated sample. Based on an equivalence relationship we defined between states of reward machines, we transfer the q-functions between the hypothesis reward machines in consecutive iterations. We prove that the proposed algorithm converges almost surely to an optimal policy in the limit if a minimal reward machine can be inferred and the maximal length of each RL episode is sufficiently long. The experiments show that learning high-level knowledge in the form of reward machines can lead to fast convergence to optimal policies in RL, while standard RL methods such as q-learning and hierarchical RL methods fail to converge to optimal policies after a substantial number of training steps in many tasks.
Transferring high-level knowledge from a source task to a target task is an effective way to expedite reinforcement learning (RL). For example, propositional logic and first-order logic have been used as representations of such knowledge. We study the transfer of knowledge between tasks in which the timing of the events matters. We call such tasks temporal tasks. We concretize similarity between temporal tasks through a notion of logical transferability, and develop a transfer learning approach between different yet similar temporal tasks. We first propose an inference technique to extract metric interval temporal logic (MITL) formulas in sequential disjunctive normal form from labeled trajectories collected in RL of the two tasks. If logical transferability is identified through this inference, we construct a timed automaton for each sequential conjunctive subformula of the inferred MITL formulas from both tasks. We perform RL on the extended state which includes the locations and clock valuations of the timed automata for the source task. We then establish mappings between the corresponding components (clocks, locations, etc.) of the timed automata from the two tasks, and transfer the extended Q-functions based on the established mappings. Finally, we perform RL on the extended state for the target task, starting with the transferred extended Q-functions. Our results in two case studies show, depending on how similar the source task and the target task are, that the sampling efficiency for the target task can be improved by up to one order of magnitude by performing RL in the extended state space, and further improved by up to another order of magnitude using the transferred extended Q-functions.
We present a novel unsupervised deep learning approach that utilizes the encoder-decoder architecture for detecting anomalies in sequential sensor data collected during industrial manufacturing. Our approach is designed not only to detect whether there exists an anomaly at a given time step, but also to predict what will happen next in the (sequential) process. We demonstrate our approach on a dataset collected from a real-world testbed. The dataset contains images collected under both normal conditions and synthetic anomalies. We show that the encoder-decoder model is able to identify the injected anomalies in a modern manufacturing process in an unsupervised fashion. In addition, it also gives hints about the temperature non-uniformity of the testbed during manufacturing, which is what we are not aware of before doing the experiment.
We synthesize shared control protocols subject to probabilistic temporal logic specifications. More specifically, we develop a framework in which a human and an autonomy protocol can issue commands to carry out a certain task. We blend these commands into a joint input to a robot. We model the interaction between the human and the robot as a Markov decision process (MDP) that represents the shared control scenario. Using inverse reinforcement learning, we obtain an abstraction of the human's behavior and decisions. We use randomized strategies to account for randomness in human's decisions, caused by factors such as complexity of the task specifications or imperfect interfaces. We design the autonomy protocol to ensure that the resulting robot behavior satisfies given safety and performance specifications in probabilistic temporal logic. Additionally, the resulting strategies generate behavior as similar to the behavior induced by the human's commands as possible. We solve the underlying problem efficiently using quasiconvex programming. Case studies involving autonomous wheelchair navigation and unmanned aerial vehicle mission planning showcase the applicability of our approach.