Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Michael Beukman

An Optimisation Framework for Unsupervised Environment Design

May 27, 2025

Nathan Monette, Alistair Letcher, Michael Beukman, Matthew T. Jackson, Alexander Rutherford, Alexander D. Goldie, Jakob N. Foerster

Abstract:For reinforcement learning agents to be deployed in high-risk settings, they must achieve a high level of robustness to unfamiliar scenarios. One method for improving robustness is unsupervised environment design (UED), a suite of methods aiming to maximise an agent's generalisability across configurations of an environment. In this work, we study UED from an optimisation perspective, providing stronger theoretical guarantees for practical settings than prior work. Whereas previous methods relied on guarantees if they reach convergence, our framework employs a nonconvex-strongly-concave objective for which we provide a provably convergent algorithm in the zero-sum setting. We empirically verify the efficacy of our method, outperforming prior methods in a number of environments with varying difficulties.

* Reinforcement Learning Conference 2025

Via

Access Paper or Ask Questions

Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasks

Oct 30, 2024

Michael Matthews, Michael Beukman, Chris Lu, Jakob Foerster

Abstract:While large models trained with self-supervised learning on offline datasets have shown remarkable capabilities in text and image domains, achieving the same generalisation for agents that act in sequential decision problems remains an open challenge. In this work, we take a step towards this goal by procedurally generating tens of millions of 2D physics-based tasks and using these to train a general reinforcement learning (RL) agent for physical control. To this end, we introduce Kinetix: an open-ended space of physics-based RL environments that can represent tasks ranging from robotic locomotion and grasping to video games and classic RL environments, all within a unified framework. Kinetix makes use of our novel hardware-accelerated physics engine Jax2D that allows us to cheaply simulate billions of environment steps during training. Our trained agent exhibits strong physical reasoning capabilities, being able to zero-shot solve unseen human-designed environments. Furthermore, fine-tuning this general agent on tasks of interest shows significantly stronger performance than training an RL agent *tabula rasa*. This includes solving some environments that standard RL training completely fails at. We believe this demonstrates the feasibility of large scale, mixed-quality pre-training for online RL and we hope that Kinetix will serve as a useful framework to investigate this further.

* The first two authors contributed equally. Project page located at: https://kinetix-env.github.io/

Via

Access Paper or Ask Questions

JaxLife: An Open-Ended Agentic Simulator

Sep 01, 2024

Chris Lu, Michael Beukman, Michael Matthews, Jakob Foerster

Abstract:Human intelligence emerged through the process of natural selection and evolution on Earth. We investigate what it would take to re-create this process in silico. While past work has often focused on low-level processes (such as simulating physics or chemistry), we instead take a more targeted approach, aiming to evolve agents that can accumulate open-ended culture and technologies across generations. Towards this, we present JaxLife: an artificial life simulator in which embodied agents, parameterized by deep neural networks, must learn to survive in an expressive world containing programmable systems. First, we describe the environment and show that it can facilitate meaningful Turing-complete computation. We then analyze the evolved emergent agents' behavior, such as rudimentary communication protocols, agriculture, and tool use. Finally, we investigate how complexity scales with the amount of compute used. We believe JaxLife takes a step towards studying evolved behavior in more open-ended simulations. Our code is available at https://github.com/luchris429/JaxLife

Via

Access Paper or Ask Questions

No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

Aug 27, 2024

Alexander Rutherford, Michael Beukman, Timon Willi, Bruno Lacerda, Nick Hawes, Jakob Foerster

Figure 1 for No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

Figure 2 for No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

Figure 3 for No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

Figure 4 for No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery

Abstract:What data or environments to use for training to improve downstream performance is a longstanding and very topical question in reinforcement learning. In particular, Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula enable agents to be robust to in- and out-of-distribution tasks. We ask to what extent these methods are themselves robust when applied to a novel setting, closely inspired by a real-world robotics problem. Surprisingly, we find that the state-of-the-art UED methods either do not improve upon the na\"{i}ve baseline of Domain Randomisation (DR), or require substantial hyperparameter tuning to do so. Our analysis shows that this is due to their underlying scoring functions failing to predict intuitive measures of ``learnability'', i.e., in finding the settings that the agent sometimes solves, but not always. Based on this, we instead directly train on levels with high learnability and find that this simple and intuitive approach outperforms UED methods and DR in several binary-outcome environments, including on our domain and the standard UED domain of Minigrid. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVaR). We open-source all our code and present visualisations of final policies here: https://github.com/amacrutherford/sampling-for-learnability.

Via

Access Paper or Ask Questions

JaxUED: A simple and useable UED library in Jax

Mar 19, 2024

Samuel Coward, Michael Beukman, Jakob Foerster

Figure 1 for JaxUED: A simple and useable UED library in Jax

Figure 2 for JaxUED: A simple and useable UED library in Jax

Figure 3 for JaxUED: A simple and useable UED library in Jax

Figure 4 for JaxUED: A simple and useable UED library in Jax

Abstract:We present JaxUED, an open-source library providing minimal dependency implementations of modern Unsupervised Environment Design (UED) algorithms in Jax. JaxUED leverages hardware acceleration to obtain on the order of 100x speedups compared to prior, CPU-based implementations. Inspired by CleanRL, we provide fast, clear, understandable, and easily modifiable implementations, with the aim of accelerating research into UED. This paper describes our library and contains baseline results. Code can be found at https://github.com/DramaCow/jaxued.

* 11 pages, 5 figures

Via

Access Paper or Ask Questions

Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning

Feb 26, 2024

Michael Matthews, Michael Beukman, Benjamin Ellis, Mikayel Samvelyan, Matthew Jackson, Samuel Coward, Jakob Foerster

Figure 1 for Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning

Figure 2 for Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning

Figure 3 for Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning

Figure 4 for Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning

Abstract:Benchmarks play a crucial role in the development and analysis of reinforcement learning (RL) algorithms. We identify that existing benchmarks used for research into open-ended learning fall into one of two categories. Either they are too slow for meaningful research to be performed without enormous computational resources, like Crafter, NetHack and Minecraft, or they are not complex enough to pose a significant challenge, like Minigrid and Procgen. To remedy this, we first present Craftax-Classic: a ground-up rewrite of Crafter in JAX that runs up to 250x faster than the Python-native original. A run of PPO using 1 billion environment interactions finishes in under an hour using only a single GPU and averages 90% of the optimal reward. To provide a more compelling challenge we present the main Craftax benchmark, a significant extension of the Crafter mechanics with elements inspired from NetHack. Solving Craftax requires deep exploration, long term planning and memory, as well as continual adaptation to novel situations as more of the world is discovered. We show that existing methods including global and episodic exploration, as well as unsupervised environment design fail to make material progress on the benchmark. We believe that Craftax can for the first time allow researchers to experiment in a complex, open-ended environment with limited computational resources.

Via

Access Paper or Ask Questions

Refining Minimax Regret for Unsupervised Environment Design

Feb 19, 2024

Michael Beukman, Samuel Coward, Michael Matthews, Mattie Fellows, Minqi Jiang, Michael Dennis, Jakob Foerster

Abstract:In unsupervised environment design, reinforcement learning agents are trained on environment configurations (levels) generated by an adversary that maximises some objective. Regret is a commonly used objective that theoretically results in a minimax regret (MMR) policy with desirable robustness guarantees; in particular, the agent's maximum regret is bounded. However, once the agent reaches this regret bound on all levels, the adversary will only sample levels where regret cannot be further reduced. Although there are possible performance improvements to be made outside of these regret-maximising levels, learning stagnates. In this work, we introduce Bayesian level-perfect MMR (BLP), a refinement of the minimax regret objective that overcomes this limitation. We formally show that solving for this objective results in a subset of MMR policies, and that BLP policies act consistently with a Perfect Bayesian policy over all levels. We further introduce an algorithm, ReMiDi, that results in a BLP policy at convergence. We empirically demonstrate that training on levels from a minimax regret adversary causes learning to prematurely stagnate, but that ReMiDi continues learning.

* The first two authors contributed equally

Via

Access Paper or Ask Questions

Dynamics Generalisation in Reinforcement Learning via Adaptive Context-Aware Policies

Oct 25, 2023

Michael Beukman, Devon Jarvis, Richard Klein, Steven James, Benjamin Rosman

Figure 1 for Dynamics Generalisation in Reinforcement Learning via Adaptive Context-Aware Policies

Figure 2 for Dynamics Generalisation in Reinforcement Learning via Adaptive Context-Aware Policies

Figure 3 for Dynamics Generalisation in Reinforcement Learning via Adaptive Context-Aware Policies

Figure 4 for Dynamics Generalisation in Reinforcement Learning via Adaptive Context-Aware Policies

Abstract:While reinforcement learning has achieved remarkable successes in several domains, its real-world application is limited due to many methods failing to generalise to unfamiliar conditions. In this work, we consider the problem of generalising to new transition dynamics, corresponding to cases in which the environment's response to the agent's actions differs. For example, the gravitational force exerted on a robot depends on its mass and changes the robot's mobility. Consequently, in such cases, it is necessary to condition an agent's actions on extrinsic state information and pertinent contextual information reflecting how the environment responds. While the need for context-sensitive policies has been established, the manner in which context is incorporated architecturally has received less attention. Thus, in this work, we present an investigation into how context information should be incorporated into behaviour learning to improve generalisation. To this end, we introduce a neural network architecture, the Decision Adapter, which generates the weights of an adapter module and conditions the behaviour of an agent on the context information. We show that the Decision Adapter is a useful generalisation of a previously proposed architecture and empirically demonstrate that it results in superior generalisation performance compared to previous approaches in several environments. Beyond this, the Decision Adapter is more robust to irrelevant distractor variables than several alternative methods.

* Accepted to NeurIPS 2023

Via

Access Paper or Ask Questions

Analysing Cross-Lingual Transfer in Low-Resourced African Named Entity Recognition

Sep 11, 2023

Michael Beukman, Manuel Fokam

Abstract:Transfer learning has led to large gains in performance for nearly all NLP tasks while making downstream models easier and faster to train. This has also been extended to low-resourced languages, with some success. We investigate the properties of cross-lingual transfer learning between ten low-resourced languages, from the perspective of a named entity recognition task. We specifically investigate how much adaptive fine-tuning and the choice of transfer language affect zero-shot transfer performance. We find that models that perform well on a single language often do so at the expense of generalising to others, while models with the best generalisation to other languages suffer in individual language performance. Furthermore, the amount of data overlap between the source and target datasets is a better predictor of transfer performance than either the geographical or genetic distance between the languages.

* Accepted to IJCNLP-AACL 2023

Via

Access Paper or Ask Questions

Hierarchically Composing Level Generators for the Creation of Complex Structures

Feb 03, 2023

Michael Beukman, Manuel Fokam, Marcel Kruger, Guy Axelrod, Muhammad Nasir, Branden Ingram, Benjamin Rosman, Steven James

Figure 1 for Hierarchically Composing Level Generators for the Creation of Complex Structures

Figure 2 for Hierarchically Composing Level Generators for the Creation of Complex Structures

Figure 3 for Hierarchically Composing Level Generators for the Creation of Complex Structures

Figure 4 for Hierarchically Composing Level Generators for the Creation of Complex Structures

Abstract:Procedural content generation (PCG) is a growing field, with numerous applications in the video game industry, and great potential to help create better games at a fraction of the cost of manual creation. However, much of the work in PCG is focused on generating relatively straightforward levels in simple games, as it is challenging to design an optimisable objective function for complex settings. This limits the applicability of PCG to more complex and modern titles, hindering its adoption in industry. Our work aims to address this limitation by introducing a compositional level generation method, which recursively composes simple, low-level generators together to construct large and complex creations. This approach allows for easily-optimisable objectives and the ability to design a complex structure in an interpretable way by referencing lower-level components. We empirically demonstrate that our method outperforms a non-compositional baseline by more accurately satisfying a designer's functional requirements in several tasks. Finally, we provide a qualitative showcase (in Minecraft) illustrating the large and complex, but still coherent, structures that were generated using simple base generators.

* Code is available at https://github.com/Michael-Beukman/MCHAMR. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions