Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Frans A. Oliehoek

Inverse Concave-Utility Reinforcement Learning is Inverse Game Theory

May 29, 2024

Mustafa Mert Çelikok, Frans A. Oliehoek, Jan-Willem van de Meent

Figure 1 for Inverse Concave-Utility Reinforcement Learning is Inverse Game Theory

Figure 2 for Inverse Concave-Utility Reinforcement Learning is Inverse Game Theory

Abstract:We consider inverse reinforcement learning problems with concave utilities. Concave Utility Reinforcement Learning (CURL) is a generalisation of the standard RL objective, which employs a concave function of the state occupancy measure, rather than a linear function. CURL has garnered recent attention for its ability to represent instances of many important applications including the standard RL such as imitation learning, pure exploration, constrained MDPs, offline RL, human-regularized RL, and others. Inverse reinforcement learning is a powerful paradigm that focuses on recovering an unknown reward function that can rationalize the observed behaviour of an agent. There has been recent theoretical advances in inverse RL where the problem is formulated as identifying the set of feasible reward functions. However, inverse RL for CURL problems has not been considered previously. In this paper we show that most of the standard IRL results do not apply to CURL in general, since CURL invalidates the classical Bellman equations. This calls for a new theoretical framework for the inverse CURL problem. Using a recent equivalence result between CURL and Mean-field Games, we propose a new definition for the feasible rewards for I-CURL by proving that this problem is equivalent to an inverse game theory problem in a subclass of mean-field games. We present initial query and sample complexity results for the I-CURL problem under assumptions such as Lipschitz-continuity. Finally, we outline future directions and applications in human--AI collaboration enabled by our results.

Via

Access Paper or Ask Questions

Policy Space Response Oracles: A Survey

Mar 04, 2024

Ariyan Bighashdel, Yongzhao Wang, Stephen McAleer, Rahul Savani, Frans A. Oliehoek

Figure 1 for Policy Space Response Oracles: A Survey

Figure 2 for Policy Space Response Oracles: A Survey

Abstract:In game theory, a game refers to a model of interaction among rational decision-makers or players, making choices with the goal of achieving their individual objectives. Understanding their behavior in games is often referred to as game reasoning. This survey provides a comprehensive overview of a fast-developing game-reasoning framework for large games, known as Policy Space Response Oracles (PSRO). We first motivate PSRO, provide historical context, and position PSRO within game-reasoning approaches. We then focus on the strategy exploration issue for PSRO, the challenge of assembling an effective strategy portfolio for modeling the underlying game with minimum computational cost. We also survey current research directions for enhancing the efficiency of PSRO, and explore the applications of PSRO across various domains. We conclude by discussing open questions and future research.

* Ariyan Bighashdel and Yongzhao Wang contributed equally

Via

Access Paper or Ask Questions

When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Feb 19, 2024

Davide Mambelli, Stephan Bongers, Onno Zoeter, Matthijs T. J. Spaan, Frans A. Oliehoek

Figure 1 for When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Figure 2 for When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Figure 3 for When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Figure 4 for When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Abstract:Policy gradient methods are widely adopted reinforcement learning algorithms for tasks with continuous action spaces. These methods succeeded in many application domains, however, because of their notorious sample inefficiency their use remains limited to problems where fast and accurate simulations are available. A common way to improve sample efficiency is to modify their objective function to be computable from off-policy samples without importance sampling. A well-established off-policy objective is the excursion objective. This work studies the difference between the excursion objective and the traditional on-policy objective, which we refer to as the on-off gap. We provide the first theoretical analysis showing conditions to reduce the on-off gap while establishing empirical evidence of shortfalls arising when these conditions are not met.

Via

Access Paper or Ask Questions

What Lies beyond the Pareto Front? A Survey on Decision-Support Methods for Multi-Objective Optimization

Nov 19, 2023

Zuzanna Osika, Jazmin Zatarain Salazar, Diederik M. Roijers, Frans A. Oliehoek, Pradeep K. Murukannaiah

Figure 1 for What Lies beyond the Pareto Front? A Survey on Decision-Support Methods for Multi-Objective Optimization

Figure 2 for What Lies beyond the Pareto Front? A Survey on Decision-Support Methods for Multi-Objective Optimization

Figure 3 for What Lies beyond the Pareto Front? A Survey on Decision-Support Methods for Multi-Objective Optimization

Figure 4 for What Lies beyond the Pareto Front? A Survey on Decision-Support Methods for Multi-Objective Optimization

Abstract:We present a review that unifies decision-support methods for exploring the solutions produced by multi-objective optimization (MOO) algorithms. As MOO is applied to solve diverse problems, approaches for analyzing the trade-offs offered by MOO algorithms are scattered across fields. We provide an overview of the advances on this topic, including methods for visualization, mining the solution set, and uncertainty exploration as well as emerging research directions, including interactivity, explainability, and ethics. We synthesize these methods drawing from different fields of research to build a unified approach, independent of the application. Our goals are to reduce the entry barrier for researchers and practitioners on using MOO algorithms and to provide novel research directions.

* IJCAI 2023 Conference Paper, Survey Track

Via

Access Paper or Ask Questions

Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Jun 04, 2023

Miguel Suau, Matthijs T. J. Spaan, Frans A. Oliehoek

Figure 1 for Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Figure 2 for Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Figure 3 for Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Figure 4 for Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Abstract:Reinforcement learning agents may sometimes develop habits that are effective only when specific policies are followed. After an initial exploration phase in which agents try out different actions, they eventually converge toward a particular policy. When this occurs, the distribution of state-action trajectories becomes narrower, and agents start experiencing the same transitions again and again. At this point, spurious correlations may arise. Agents may then pick up on these correlations and learn state representations that do not generalize beyond the agent's trajectory distribution. In this paper, we provide a mathematical characterization of this phenomenon, which we refer to as policy confounding, and show, through a series of examples, when and how it occurs in practice.

Via

Access Paper or Ask Questions

What model does MuZero learn?

Jun 01, 2023

Jinke He, Thomas M. Moerland, Frans A. Oliehoek

Figure 1 for What model does MuZero learn?

Figure 2 for What model does MuZero learn?

Figure 3 for What model does MuZero learn?

Figure 4 for What model does MuZero learn?

Abstract:Model-based reinforcement learning has drawn considerable interest in recent years, given its promise to improve sample efficiency. Moreover, when using deep-learned models, it is potentially possible to learn compact models from complex sensor data. However, the effectiveness of these learned models, particularly their capacity to plan, i.e., to improve the current policy, remains unclear. In this work, we study MuZero, a well-known deep model-based reinforcement learning algorithm, and explore how far it achieves its learning objective of a value-equivalent model and how useful the learned models are for policy improvement. Amongst various other insights, we conclude that the model learned by MuZero cannot effectively generalize to evaluate unseen policies, which limits the extent to which we can additionally improve the current policy by planning with the model.

Via

Access Paper or Ask Questions

Towards a Unifying Model of Rationality in Multiagent Systems

May 29, 2023

Robert Loftin, Mustafa Mert Çelikok, Frans A. Oliehoek

Abstract:Multiagent systems deployed in the real world need to cooperate with other agents (including humans) nearly as effectively as these agents cooperate with one another. To design such AI, and provide guarantees of its effectiveness, we need to clearly specify what types of agents our AI must be able to cooperate with. In this work we propose a generic model of socially intelligent agents, which are individually rational learners that are also able to cooperate with one another (in the sense that their joint behavior is Pareto efficient). We define rationality in terms of the regret incurred by each agent over its lifetime, and show how we can construct socially intelligent agents for different forms of regret. We then discuss the implications of this model for the development of "robust" MAS that can cooperate with a wide variety of socially intelligent agents.

* 5 Pages, To appear in the OptLearnMAS Workshop at AAMAS 2023

Via

Access Paper or Ask Questions

Safety Guarantees in Multi-agent Learning via Trapping Regions

Feb 27, 2023

Aleksander Czechowski, Frans A. Oliehoek

Abstract:One of the main challenges of multi-agent learning lies in establishing convergence of the algorithms, as, in general, a collection of individual, self-serving agents is not guaranteed to converge with their joint policy, when learning concurrently. This is in stark contrast to most single-agent environments, and sets a prohibitive barrier for deployment in practical applications, as it induces uncertainty in long term behavior of the system. In this work, we propose to apply the concept of trapping regions, known from qualitative theory of dynamical systems, to create safety sets in the joint strategy space for decentralized learning. Upon verification of the direction of learning dynamics, the resulting trajectories are guaranteed not to escape such sets, during the learning process. As a result, it is ensured, that despite the uncertainty over convergence of the applied algorithms, learning will never form hazardous joint strategy combinations. We introduce a binary partitioning algorithm for verification of trapping regions in systems with known learning dynamics, and a heuristic sampling algorithm for scenarios where learning dynamics are not known. In addition, via a fixed point argument, we show the existence of a learning equilibrium within a trapping region. We demonstrate the applications to a regularized version of Dirac Generative Adversarial Network, a four-intersection traffic control scenario run in a state of the art open-source microscopic traffic simulator SUMO, and a mathematical model of economic competition.

Via

Access Paper or Ask Questions

Uncoupled Learning of Differential Stackelberg Equilibria with Commitments

Feb 07, 2023

Robert Loftin, Mustafa Mert Çelikok, Herke van Hoof, Samuel Kaski, Frans A. Oliehoek

Abstract:A natural solution concept for many multiagent settings is the Stackelberg equilibrium, under which a ``leader'' agent selects a strategy that maximizes its own payoff assuming the ``follower'' chooses their best response to this strategy. Recent work has presented asymmetric learning updates that can be shown to converge to the \textit{differential} Stackelberg equilibria of two-player differentiable games. These updates are ``coupled'' in the sense that the leader requires some information about the follower's payoff function. Such coupled learning rules cannot be applied to \textit{ad hoc} interactive learning settings, and can be computationally impractical even in centralized training settings where the follower's payoffs are known. In this work, we present an ``uncoupled'' learning process under which each player's learning update only depends on their observations of the other's behavior. We prove that this process converges to a local Stackelberg equilibrium under similar conditions as previous coupled methods. We conclude with a discussion of the potential applications of our approach to human--AI cooperation and multi-agent reinforcement learning.

Via

Access Paper or Ask Questions

An Analysis of Abstracted Model-Based Reinforcement Learning

Aug 30, 2022

Rolf A. N. Starre, Marco Loog, Frans A. Oliehoek

Abstract:Many methods for Model-based Reinforcement learning (MBRL) provide guarantees for both the accuracy of the Markov decision process (MDP) model they can deliver and the learning efficiency. At the same time, state abstraction techniques allow for a reduction of the size of an MDP while maintaining a bounded loss with respect to the original problem. It may come as a surprise, therefore, that no such guarantees are available when combining both techniques, i.e., where MBRL merely observes abstract states. Our theoretical analysis shows that abstraction can introduce a dependence between samples collected online (e.g., in the real world), which means that most results for MBRL can not be directly extended to this setting. The new results in this work show that concentration inequalities for martingales can be used to overcome this problem and allows for extending the results of algorithms such as R-MAX to the setting with abstraction. Thus producing the first performance guarantees for Abstracted RL: model-based reinforcement learning with an abstracted model.

* 26 pages, 2 figures, submitted to NeurIPS 2022

Via

Access Paper or Ask Questions