Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Özgür Şimşek

Learning The Minimum Action Distance

Jun 10, 2025

Lorenzo Steccanella, Joshua B. Evans, Özgür Şimşek, Anders Jonsson

Abstract:This paper presents a state representation framework for Markov decision processes (MDPs) that can be learned solely from state trajectories, requiring neither reward signals nor the actions executed by the agent. We propose learning the minimum action distance (MAD), defined as the minimum number of actions required to transition between states, as a fundamental metric that captures the underlying structure of an environment. MAD naturally enables critical downstream tasks such as goal-conditioned reinforcement learning and reward shaping by providing a dense, geometrically meaningful measure of progress. Our self-supervised learning approach constructs an embedding space where the distances between embedded state pairs correspond to their MAD, accommodating both symmetric and asymmetric approximations. We evaluate the framework on a comprehensive suite of environments with known MAD values, encompassing both deterministic and stochastic dynamics, as well as discrete and continuous state spaces, and environments with noisy observations. Empirical results demonstrate that the proposed approach not only efficiently learns accurate MAD representations across these diverse settings but also significantly outperforms existing state representation methods in terms of representation quality.

Via

Access Paper or Ask Questions

A Theoretical Framework for Explaining Reinforcement Learning with Shapley Values

May 12, 2025

Daniel Beechey, Thomas M. S. Smith, Özgür Şimşek

Abstract:Reinforcement learning agents can achieve superhuman performance, but their decisions are often difficult to interpret. This lack of transparency limits deployment, especially in safety-critical settings where human trust and accountability are essential. In this work, we develop a theoretical framework for explaining reinforcement learning through the influence of state features, which represent what the agent observes in its environment. We identify three core elements of the agent-environment interaction that benefit from explanation: behaviour (what the agent does), performance (what the agent achieves), and value estimation (what the agent expects to achieve). We treat state features as players cooperating to produce each element and apply Shapley values, a principled method from cooperative game theory, to identify the influence of each feature. This approach yields a family of mathematically grounded explanations with clear semantics and theoretical guarantees. We use illustrative examples to show how these explanations align with human intuition and reveal novel insights. Our framework unifies and extends prior work, making explicit the assumptions behind existing approaches, and offers a principled foundation for more interpretable and trustworthy reinforcement learning.

Via

Access Paper or Ask Questions

Curricula for Learning Robust Policies with Factored State Representations in Changing Environments

Sep 19, 2024

Panayiotis Panayiotou, Özgür Şimşek

Figure 1 for Curricula for Learning Robust Policies with Factored State Representations in Changing Environments

Figure 2 for Curricula for Learning Robust Policies with Factored State Representations in Changing Environments

Figure 3 for Curricula for Learning Robust Policies with Factored State Representations in Changing Environments

Figure 4 for Curricula for Learning Robust Policies with Factored State Representations in Changing Environments

Abstract:Robust policies enable reinforcement learning agents to effectively adapt to and operate in unpredictable, dynamic, and ever-changing real-world environments. Factored representations, which break down complex state and action spaces into distinct components, can improve generalization and sample efficiency in policy learning. In this paper, we explore how the curriculum of an agent using a factored state representation affects the robustness of the learned policy. We experimentally demonstrate three simple curricula, such as varying only the variable of highest regret between episodes, that can significantly enhance policy robustness, offering practical insights for reinforcement learning in complex environments.

* 17th European Workshop on Reinforcement Learning (EWRL 2024)

Via

Access Paper or Ask Questions

Curricula for Learning Robust Policies over Factored State Representations in Changing Environments

Sep 13, 2024

Panayiotis Panayiotou, Özgür Şimşek

Figure 1 for Curricula for Learning Robust Policies over Factored State Representations in Changing Environments

Figure 2 for Curricula for Learning Robust Policies over Factored State Representations in Changing Environments

Figure 3 for Curricula for Learning Robust Policies over Factored State Representations in Changing Environments

Figure 4 for Curricula for Learning Robust Policies over Factored State Representations in Changing Environments

* 17th European Workshop on Reinforcement Learning (EWRL 2024)

Via

Access Paper or Ask Questions

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Jul 16, 2024

Karolis Jucys, George Adamopoulos, Mehrab Hamidi, Stephanie Milani, Mohammad Reza Samsami, Artem Zholus, Sonia Joseph, Blake Richards, Irina Rish, Özgür Şimşek

Abstract:Understanding the mechanisms behind decisions taken by large foundation models in sequential decision making tasks is critical to ensuring that such systems operate transparently and safely. In this work, we perform exploratory analysis on the Video PreTraining (VPT) Minecraft playing agent, one of the largest open-source vision-based agents. We aim to illuminate its reasoning mechanisms by applying various interpretability techniques. First, we analyze the attention mechanism while the agent solves its training task - crafting a diamond pickaxe. The agent pays attention to the last four frames and several key-frames further back in its six-second memory. This is a possible mechanism for maintaining coherence in a task that takes 3-10 minutes, despite the short memory span. Secondly, we perform various interventions, which help us uncover a worrying case of goal misgeneralization: VPT mistakenly identifies a villager wearing brown clothes as a tree trunk when the villager is positioned stationary under green tree leaves, and punches it to death.

* Mechanistic Interpretability Workshop at ICML 2024

Via

Access Paper or Ask Questions

Colour versus Shape Goal Misgeneralization in Reinforcement Learning: A Case Study

Dec 05, 2023

Karolis Ramanauskas, Özgür Şimşek

Abstract:We explore colour versus shape goal misgeneralization originally demonstrated by Di Langosco et al. (2022) in the Procgen Maze environment, where, given an ambiguous choice, the agents seem to prefer generalization based on colour rather than shape. After training over 1,000 agents in a simplified version of the environment and evaluating them on over 10 million episodes, we conclude that the behaviour can be attributed to the agents learning to detect the goal object through a specific colour channel. This choice is arbitrary. Additionally, we show how, due to underspecification, the preferences can change when retraining the agents using exactly the same procedure except for using a different random seed for the training run. Finally, we demonstrate the existence of outliers in out-of-distribution behaviour based on training random seed alone.

* ATTRIB: Workshop on Attributing Model Behavior at Scale at NeurIPS 2023

Via

Access Paper or Ask Questions

Creating Multi-Level Skill Hierarchies in Reinforcement Learning

Jun 16, 2023

Joshua B. Evans, Özgür Şimşek

Figure 1 for Creating Multi-Level Skill Hierarchies in Reinforcement Learning

Figure 2 for Creating Multi-Level Skill Hierarchies in Reinforcement Learning

Figure 3 for Creating Multi-Level Skill Hierarchies in Reinforcement Learning

Figure 4 for Creating Multi-Level Skill Hierarchies in Reinforcement Learning

Abstract:What is a useful skill hierarchy for an autonomous agent? We propose an answer based on the graphical structure of an agent's interaction with its environment. Our approach uses hierarchical graph partitioning to expose the structure of the graph at varying timescales, producing a skill hierarchy with multiple levels of abstraction. At each level of the hierarchy, skills move the agent between regions of the state space that are well connected within themselves but weakly connected to each other. We illustrate the utility of the proposed skill hierarchy in a wide variety of domains in the context of reinforcement learning.

* 19 pages, 12 figures

Via

Access Paper or Ask Questions

Explaining Reinforcement Learning with Shapley Values

Jun 09, 2023

Daniel Beechey, Thomas M. S. Smith, Özgür Şimşek

Abstract:For reinforcement learning systems to be widely adopted, their users must understand and trust them. We present a theoretical analysis of explaining reinforcement learning using Shapley values, following a principled approach from game theory for identifying the contribution of individual players to the outcome of a cooperative game. We call this general framework Shapley Values for Explaining Reinforcement Learning (SVERL). Our analysis exposes the limitations of earlier uses of Shapley values in reinforcement learning. We then develop an approach that uses Shapley values to explain agent performance. In a variety of domains, SVERL produces meaningful explanations that match and supplement human intuition.

* 12 pages, 9 figures. Accepted at ICML 2023

Via

Access Paper or Ask Questions

Resource-Constrained Station-Keeping for Helium Balloons using Reinforcement Learning

Mar 02, 2023

Jack Saunders, Loïc Prenevost, Özgür Şimşek, Alan Hunter, Wenbin Li

Abstract:High altitude balloons have proved useful for ecological aerial surveys, atmospheric monitoring, and communication relays. However, due to weight and power constraints, there is a need to investigate alternate modes of propulsion to navigate in the stratosphere. Very recently, reinforcement learning has been proposed as a control scheme to maintain the balloon in the region of a fixed location, facilitated through diverse opposing wind-fields at different altitudes. Although air-pump based station keeping has been explored, there is no research on the control problem for venting and ballasting actuated balloons, which is commonly used as a low-cost alternative. We show how reinforcement learning can be used for this type of balloon. Specifically, we use the soft actor-critic algorithm, which on average is able to station-keep within 50\;km for 25\% of the flight, consistent with state-of-the-art. Furthermore, we show that the proposed controller effectively minimises the consumption of resources, thereby supporting long duration flights. We frame the controller as a continuous control reinforcement learning problem, which allows for a more diverse range of trajectories, as opposed to current state-of-the-art work, which uses discrete action spaces. Furthermore, through continuous control, we can make use of larger ascent rates which are not possible using air-pumps. The desired ascent-rate is decoupled into desired altitude and time-factor to provide a more transparent policy, compared to low-level control commands used in previous works. Finally, by applying the equations of motion, we establish appropriate thresholds for venting and ballasting to prevent the agent from exploiting the environment. More specifically, we ensure actions are physically feasible by enforcing constraints on venting and ballasting.

Via

Access Paper or Ask Questions

Iterative Policy-Space Expansion in Reinforcement Learning

Dec 05, 2019

Jan Malte Lichtenberg, Özgür Şimşek

Figure 1 for Iterative Policy-Space Expansion in Reinforcement Learning

Figure 2 for Iterative Policy-Space Expansion in Reinforcement Learning

Abstract:Humans and animals solve a difficult problem much more easily when they are presented with a sequence of problems that starts simple and slowly increases in difficulty. We explore this idea in the context of reinforcement learning. Rather than providing the agent with an externally provided curriculum of progressively more difficult tasks, the agent solves a single task utilizing a decreasingly constrained policy space. The algorithm we propose first learns to categorize features into positive and negative before gradually learning a more refined policy. Experimental results in Tetris demonstrate superior learning rate of our approach when compared to existing algorithms.

* Workshop on Biological and Artificial Reinforcement Learning at the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada

Via

Access Paper or Ask Questions