This paper presents an algorithmic framework for learning robust policies in asymmetric imperfect-information games, where the joint reward could depend on the uncertain opponent type (a private information known only to the opponent itself and its ally). In order to maximize the reward, the protagonist agent has to infer the opponent type through agent modeling. We use multiagent reinforcement learning (MARL) to learn opponent models through self-play, which captures the full strategy interaction and reasoning between agents. However, agent policies learned from self-play can suffer from mutual overfitting. Ensemble training methods can be used to improve the robustness of agent policy against different opponents, but it also significantly increases the computational overhead. In order to achieve a good trade-off between the robustness of the learned policy and the computation complexity, we propose to train a separate opponent policy against the protagonist agent for evaluation purposes. The reward achieved by this opponent is a noisy measure of the robustness of the protagonist agent policy due to the intrinsic stochastic nature of a reinforcement learner. To handle this stochasticity, we apply a stochastic optimization scheme to dynamically update the opponent ensemble to optimize an objective function that strikes a balance between robustness and computation complexity. We empirically show that, under the same limited computational budget, the proposed method results in more robust policy learning than standard ensemble training.
A common approach for defining a reward function for Multi-objective Reinforcement Learning (MORL) problems is the weighted sum of the multiple objectives. The weights are then treated as design parameters dependent on the expertise (and preference) of the person performing the learning, with the typical result that a new solution is required for any change in these settings. This paper investigates the relationship between the reward function and the optimal value function for MORL; specifically addressing the question of how to approximate the optimal value function well beyond the set of weights for which the optimization problem was actually solved, thereby avoiding the need to recompute for any particular choice. We prove that the value function transforms smoothly given a transformation of weights of the reward function (and thus a smooth interpolation in the policy space). A Gaussian process is used to obtain a smooth interpolation over the reward function weights of the optimal value function for three well-known examples: GridWorld, Objectworld and Pendulum. The results show that the interpolation can provide very robust values for sample states and action space in discrete and continuous domain problems. Significant advantages arise from utilizing this interpolation technique in the domain of autonomous vehicles: easy, instant adaptation of user preferences while driving and true randomization of obstacle vehicle behavior preferences during training.
We present a multi-robot system for GPS-denied search and rescue under the forest canopy. Forests are particularly challenging environments for collaborative exploration and mapping, in large part due to the existence of severe perceptual aliasing which hinders reliable loop closure detection for mutual localization and map fusion. Our proposed system features unmanned aerial vehicles (UAVs) that perform onboard sensing, estimation, and planning. When communication is available, each UAV transmits compressed tree-based submaps to a central ground station for collaborative simultaneous localization and mapping (CSLAM). To overcome high measurement noise and perceptual aliasing, we use the local configuration of a group of trees as a distinctive feature for robust loop closure detection. Furthermore, we propose a novel procedure based on cycle consistent multiway matching to recover from incorrect pairwise data associations. The returned global data association is guaranteed to be cycle consistent, and is shown to improve both precision and recall compared to the input pairwise associations. The proposed multi-UAV system is validated both in simulation and during real-world collaborative exploration missions at NASA Langley Research Center.
As autonomous systems rely increasingly on onboard sensors for localization and perception, the parallel tasks of motion planning and uncertainty minimization become increasingly coupled. This coupling is well-captured by augmenting the planning objective with a posterior-covariance penalty -- however, online optimization can be computationally intractable, particularly for observation models with latent environmental dependencies (e.g., unknown landmarks). This paper addresses a number of fundamental challenges in efficient minimization of the posterior covariance. First, we provide a measurement bundling approximation that enables high-rate sensors to be approximated with fewer, low-rate updates. This allows for landmark marginalization (crucial in the case of unknown landmarks), for which we provide a novel recipe for computing the gradients necessary for optimization. Finally, we identify a large class of measurement models for which the contributions from each landmark can be combined, so evaluation of the total information gained at each timestep can be carried out (nearly) independently of the number of landmarks. We evaluate our trajectory-generation framework for both a Dubin's car and a quadrotor, demonstrating significant estimation improvement and moderate computation time.
Last-mile delivery systems commonly propose the use of autonomous robotic vehicles to increase scalability and efficiency. The economic inefficiency of collecting accurate prior maps for navigation motivates the use of planning algorithms that operate in unmapped environments. However, these algorithms typically waste time exploring regions that are unlikely to contain the delivery destination. Context is key information about structured environments that could guide exploration toward the unknown goal location, but the abstract idea is difficult to quantify for use in a planning algorithm. Some approaches specifically consider contextual relationships between objects, but would perform poorly in object-sparse environments like outdoors. Recent deep learning-based approaches consider context too generally, making training/transferability difficult. Therefore, this work proposes a novel formulation of utilizing context for planning as an image-to-image translation problem, which is shown to extract terrain context from semantic gridmaps, into a metric that an exploration-based planner can use. The proposed framework has the benefit of training on a static dataset instead of requiring a time-consuming simulator. Across 42 test houses with layouts from satellite images, the trained algorithm enables a robot to reach its goal 189\% faster than with a context-unaware planner, and within 63\% of the optimal path computed with a prior map. The proposed algorithm is also implemented on a vehicle with a forward-facing camera in a high-fidelity, Unreal simulation of neighborhood houses.
This paper presents resource-aware algorithms for distributed inter-robot loop closure detection for applications such as collaborative simultaneous localization and mapping (CSLAM) and distributed image retrieval. In real-world scenarios, this process is resource-intensive as it involves exchanging many observations and geometrically verifying a large number of potential matches. This poses severe challenges for small-size and low-cost robots with various operational and resource constraints that limit, e.g., energy consumption, communication bandwidth, and computation capacity. This paper proposes a framework in which robots first exchange compact queries to identify a set of potential loop closures. We then seek to select a subset of potential inter-robot loop closures for geometric verification that maximizes a monotone submodular performance metric without exceeding budgets on computation (number of geometric verifications) and communication (amount of exchanged data for geometric verification). We demonstrate that this problem is in general NP-hard, and present efficient approximation algorithms with provable performance guarantees. The proposed framework is extensively evaluated on real and synthetic datasets. A natural convex relaxation scheme is also presented to certify the near-optimal performance of the proposed framework in practice.
Multiagent reinforcement learning algorithms (MARL) have been demonstrated on complex tasks that require the coordination of a team of multiple agents to complete. Existing works have focused on sharing information between agents via centralized critics to stabilize learning or through communication to increase performance, but do not generally look at how information can be shared between agents to address the curse of dimensionality in MARL. We posit that a multiagent problem can be decomposed into a multi-task problem where each agent explores a subset of the state space instead of exploring the entire state space. This paper introduces a multiagent actor-critic algorithm and method for combining knowledge from homogeneous agents through distillation and value-matching that outperforms policy distillation alone and allows further learning in both discrete and continuous action spaces.
The so-called Burer-Monteiro method is a well-studied technique for solving large-scale semidefinite programs (SDPs) via low-rank factorization. The main idea is to solve rank-restricted, albeit non-convex, surrogates instead of the SDP. Recent works have shown that, in an important class of SDPs with elegant geometric structure, one can find globally optimal solutions to the SDP by finding rank-deficient second-order critical points of an unconstrained Riemannian optimization problem. Hence, in such problems, the Burer-Monteiro approach can provide a scalable and reliable alternative to interior-point methods that scale poorly. Among various Riemannian optimization methods proposed, block-coordinate minimization (BCM) is of particular interest due to its simplicity. Erdogdu et al. in their recent work proposed BCM for problems over the Cartesian product of unit spheres and provided global convergence rate estimates for the algorithm. This report extends the BCM algorithm and the global convergence rate analysis of Erdogdu et al. from problems over the Cartesian product of unit spheres to the Cartesian product of Stiefel manifolds. The latter more general setting has important applications such as synchronization over the special orthogonal (SO) and special Euclidean (SE) groups.
Planning high-speed trajectories for UAVs in unknown environments requires extremely fast algorithms able to solve the trajectory generation problem in real-time in order to be able to react quickly to the changing knowledge of the world, but that guarantee safety at all times. The desire of maintaining computational tractability typically leads to optimization problems that do not include the obstacles (collision checks are done on the solutions) or to formulations that use a convex decomposition of the free space and then impose an ad hoc allocation of each interval of the trajectory in a specific polyhedron. Moreover, safety guarantees are usually obtained by having a local planner that plans a trajectory with a final "stop" condition in the free-known space. However, these two decisions typically lead to slow and conservative trajectories. We propose FaSTrap (Fast and Safe Trajectory Planner) to overcome these issues. FasTrap obtains faster trajectories by enabling the local planner to optimize in both free-known and unknown spaces. Safety guarantees are ensured by always having a feasible, safe back-up trajectory in the free-known space at the start of each replanning step. Furthermore, we present a Mixed Integer Quadratic Problem (MIQP) formulation in which the solver can choose the interval allocation and where a heuristics for the time allocation is computed efficiently using the result of the previous replanning iteration. This proposed algorithm is tested both in simulation and in real hardware, showing agile flights in unknown cluttered environments.