Existing communication methods for multi-agent reinforcement learning (MARL) in cooperative multi-robot problems are almost exclusively task-specific, training new communication strategies for each unique task. We address this inefficiency by introducing a communication strategy applicable to any task within a given environment. We pre-train the communication strategy without task-specific reward guidance in a self-supervised manner using a set autoencoder. Our objective is to learn a fixed-size latent Markov state from a variable number of agent observations. Under mild assumptions, we prove that policies using our latent representations are guaranteed to converge, and upper bound the value error introduced by our Markov state approximation. Our method enables seamless adaptation to novel tasks without fine-tuning the communication strategy, gracefully supports scaling to more agents than present during training, and detects out-of-distribution events in an environment. Empirical results on diverse MARL scenarios validate the effectiveness of our approach, surpassing task-specific communication strategies in unseen tasks. Our implementation of this work is available at https://github.com/proroklab/task-agnostic-comms.
In RL, memory models such as RNNs and transformers address Partially Observable Markov Decision Processes (POMDPs) by mapping trajectories to latent Markov states. Neither model scales particularly well to long sequences, especially compared to an emerging class of memory models sometimes called linear recurrent models. We discover that the recurrent update of these models is a monoid, leading us to formally define a novel memory monoid framework. We revisit the traditional approach to batching in recurrent RL, highlighting both theoretical and empirical deficiencies. Leveraging the properties of memory monoids, we propose a new batching method that improves sample efficiency, increases the return, and simplifies the implementation of recurrent loss functions in RL.
Nearly all real world tasks are inherently partially observable, necessitating the use of memory in Reinforcement Learning (RL). Most model-free approaches summarize the trajectory into a latent Markov state using memory models borrowed from Supervised Learning (SL), even though RL tends to exhibit different training and efficiency characteristics. Addressing this discrepancy, we introduce Fast and Forgetful Memory, an algorithm-agnostic memory model designed specifically for RL. Our approach constrains the model search space via strong structural priors inspired by computational psychology. It is a drop-in replacement for recurrent neural networks (RNNs) in recurrent RL algorithms, achieving greater reward than RNNs across various recurrent benchmarks and algorithms without changing any hyperparameters. Moreover, Fast and Forgetful Memory exhibits training speeds two orders of magnitude faster than RNNs, attributed to its logarithmic time and linear space complexity. Our implementation is available at https://github.com/proroklab/ffm.
Graph Neural Network (GNN) architectures are defined by their implementations of update and aggregation modules. While many works focus on new ways to parametrise the update modules, the aggregation modules receive comparatively little attention. Because it is difficult to parametrise aggregation functions, currently most methods select a "standard aggregator" such as $\mathrm{mean}$, $\mathrm{sum}$, or $\mathrm{max}$. While this selection is often made without any reasoning, it has been shown that the choice in aggregator has a significant impact on performance, and the best choice in aggregator is problem-dependent. Since aggregation is a lossy operation, it is crucial to select the most appropriate aggregator in order to minimise information loss. In this paper, we present GenAgg, a generalised aggregation operator, which parametrises a function space that includes all standard aggregators. In our experiments, we show that GenAgg is able to represent the standard aggregators with much higher accuracy than baseline methods. We also show that using GenAgg as a drop-in replacement for an existing aggregator in a GNN often leads to a significant boost in performance across various tasks.
Real world applications of Reinforcement Learning (RL) are often partially observable, thus requiring memory. Despite this, partial observability is still largely ignored by contemporary RL benchmarks and libraries. We introduce Partially Observable Process Gym (POPGym), a two-part library containing (1) a diverse collection of 15 partially observable environments, each with multiple difficulties and (2) implementations of 13 memory model baselines -- the most in a single RL library. Existing partially observable benchmarks tend to fixate on 3D visual navigation, which is computationally expensive and only one type of POMDP. In contrast, POPGym environments are diverse, produce smaller observations, use less memory, and often converge within two hours of training on a consumer-grade GPU. We implement our high-level memory API and memory baselines on top of the popular RLlib framework, providing plug-and-play compatibility with various training algorithms, exploration strategies, and distributed training paradigms. Using POPGym, we execute the largest comparison across RL memory models to date. POPGym is available at https://github.com/proroklab/popgym.
The problem of permutation-invariant learning over set representations is particularly relevant in the field of multi-agent systems -- a few potential applications include unsupervised training of aggregation functions in graph neural networks (GNNs), neural cellular automata on graphs, and prediction of scenes with multiple objects. Yet existing approaches to set encoding and decoding tasks present a host of issues, including non-permutation-invariance, fixed-length outputs, reliance on iterative methods, non-deterministic outputs, computationally expensive loss functions, and poor reconstruction accuracy. In this paper we introduce a Permutation-Invariant Set Autoencoder (PISA), which tackles these problems and produces encodings with significantly lower reconstruction error than existing baselines. PISA also provides other desirable properties, including a similarity-preserving latent space, and the ability to insert or remove elements from the encoding. After evaluating PISA against baseline methods, we demonstrate its usefulness in a multi-agent application. Using PISA as a subcomponent, we introduce a novel GNN architecture which serves as a generalised communication scheme, allowing agents to use communication to gain full observability of a system.
Graph Neural Networks (GNNs) are a paradigm-shifting neural architecture to facilitate the learning of complex multi-agent behaviors. Recent work has demonstrated remarkable performance in tasks such as flocking, multi-agent path planning and cooperative coverage. However, the policies derived through GNN-based learning schemes have not yet been deployed to the real-world on physical multi-robot systems. In this work, we present the design of a system that allows for fully decentralized execution of GNN-based policies. We create a framework based on ROS2 and elaborate its details in this paper. We demonstrate our framework on a case-study that requires tight coordination between robots, and present first-of-a-kind results that show successful real-world deployment of GNN-based policies on a decentralized multi-robot system relying on Adhoc communication. A video demonstration of this case-study can be found online. https://www.youtube.com/watch?v=COh-WLn4iO4
The rapid advancement and miniaturization of spacecraft electronics, sensors, actuators, and power systems have resulted in growing proliferation of small-spacecraft. Coupled with this is the growing number of rocket launches, with left-over debris marking their trail. The space debris problem has also been compounded by test of several satellite killer missiles that have left large remnant debris fields. In this paper, we assume a future in which the Kessler Effect has taken hold and analyze the implications on the design of small-satellites and CubeSats. We use a multiprong approach of surveying the latest technologies, including the ability to sense space debris in orbit, perform obstacle avoidance, have sufficient shielding to take on small impacts and other techniques to mitigate the problem. Detecting and tracking space debris threats on-orbit is expected to be an important approach and we will analyze the latest vision algorithms to perform the detection, followed by quick reaction control systems to perform the avoidance. Alternately there may be scenarios where the debris is too small to track and avoid. In this case, the spacecraft will need passive mitigation measures to survive the impact. Based on these conditions, we develop a strawman design of a small spacecraft to mitigate these challenges. Based upon this study, we identify if there is sufficient present-day COTS technology to mitigate or shield satellites from the problem. We conclude by outlining technology pathways that need to be advanced now to best prepare ourselves for the worst-case eventuality of Kessler Effect taking hold in the upper altitudes of Low Earth Orbit.
Pits on the Moon and Mars are intriguing geological formations that have yet to be explored. These geological formations can provide protection from harsh diurnal temperature variations, ionizing radiation, and meteorite impacts. Some have proposed that these underground formations are well-suited as human outposts. Some theorize that the Martian pits may harbor remnants of past life. Unfortunately, these geo-logical formations have been off-limits to conventional wheeled rovers and lander systems due to their collapsed ceiling or 'skylight' entrances. In this paper, a new low-cost method to explore these pits is presented using the Spring Propelled Extreme Environment Robot (SPEER). The SPEER consists of a launch system that flings disposable spherical microbots through skylights into the pits. The microbots are low-cost and composed of aluminium Al-6061 disposable spheres with an array of adapted COTS sensors and a solid rocket motor for soft landing.By moving most control authority to the launcher, the microbots become very simple, lightweight, and low-cost. We present a preliminary design of the microbots that can be built today using commercial components for under 500 USD. The microbots have a total mass of 1 kg, with more than 750 g available for a science instrument. In this paper, we present the design, dynamics and control, and operation of these microbots. This is followed by initial feasibility studies of the SPEER system by simulating exploration of a known Lunar pit in Mare Tranquillitatis.
The discovery of ice deposits in the permanently shadowed craters of the lunar North and South Pole Moon presents an important opportunity for In-Situ Resource Utilization. These ice deposits maybe the source for sustaining a lunar base or for enabling an interplanetary refueling station. These ice deposits also preserve a unique record of the geology and environment of their hosts, both in terms of impact history and the supply of volatile compounds, and so are of immense scientific interest. To date, these ice deposits have been studied indirectly and by remote active radar, but they need to be analyzed in-situ by robotic systems that can study the depths of the deposits, their purity and composition. However, these shadowed craters never see sunlight and are one of the coldest places in the solar system. NASA JPL proposed use of solar reflectors mounted on crater rims to project sunlight into the crater depths for use by ground robots. The solar reflectors would heat the crater base and vehicles positioned at the base sufficiently to survive the cold-temperatures. Our approach analyzes part of the logistics of the approach, with teams of robots climbing up and down to the crater to access the ice deposits. The mission will require robots to climb down extreme environments and carry large structures, including instruments and communication devices.