Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Paulo Rauber

Foundation Models as World Models: A Foundational Study in Text-Based GridWorlds

Sep 19, 2025

Remo Sasso, Michelangelo Conserva, Dominik Jeurissen, Paulo Rauber

Figure 1 for Foundation Models as World Models: A Foundational Study in Text-Based GridWorlds

Figure 2 for Foundation Models as World Models: A Foundational Study in Text-Based GridWorlds

Figure 3 for Foundation Models as World Models: A Foundational Study in Text-Based GridWorlds

Figure 4 for Foundation Models as World Models: A Foundational Study in Text-Based GridWorlds

Abstract:While reinforcement learning from scratch has shown impressive results in solving sequential decision-making tasks with efficient simulators, real-world applications with expensive interactions require more sample-efficient agents. Foundation models (FMs) are natural candidates to improve sample efficiency as they possess broad knowledge and reasoning capabilities, but it is yet unclear how to effectively integrate them into the reinforcement learning framework. In this paper, we anticipate and, most importantly, evaluate two promising strategies. First, we consider the use of foundation world models (FWMs) that exploit the prior knowledge of FMs to enable training and evaluating agents with simulated interactions. Second, we consider the use of foundation agents (FAs) that exploit the reasoning capabilities of FMs for decision-making. We evaluate both approaches empirically in a family of grid-world environments that are suitable for the current generation of large language models (LLMs). Our results suggest that improvements in LLMs already translate into better FWMs and FAs; that FAs based on current LLMs can already provide excellent policies for sufficiently simple environments; and that the coupling of FWMs and reinforcement learning agents is highly promising for more complex settings with partial observability and stochastic elements.

* 20 pages, 9 figures. Accepted for presentation at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop on Embodied World Models for Decision Making

Via

Access Paper or Ask Questions

Posterior Sampling for Deep Reinforcement Learning

Apr 30, 2023

Remo Sasso, Michelangelo Conserva, Paulo Rauber

Abstract:Despite remarkable successes, deep reinforcement learning algorithms remain sample inefficient: they require an enormous amount of trial and error to find good policies. Model-based algorithms promise sample efficiency by building an environment model that can be used for planning. Posterior Sampling for Reinforcement Learning is such a model-based algorithm that has attracted significant interest due to its performance in the tabular setting. This paper introduces Posterior Sampling for Deep Reinforcement Learning (PSDRL), the first truly scalable approximation of Posterior Sampling for Reinforcement Learning that retains its model-based essence. PSDRL combines efficient uncertainty quantification over latent state space models with a specially tailored continual planning algorithm based on value-function approximation. Extensive experiments on the Atari benchmark show that PSDRL significantly outperforms previous state-of-the-art attempts at scaling up posterior sampling while being competitive with a state-of-the-art (model-based) reinforcement learning method, both in sample efficiency and computational efficiency.

Via

Access Paper or Ask Questions

Hardness in Markov Decision Processes: Theory and Practice

Oct 24, 2022

Michelangelo Conserva, Paulo Rauber

Figure 1 for Hardness in Markov Decision Processes: Theory and Practice

Figure 2 for Hardness in Markov Decision Processes: Theory and Practice

Figure 3 for Hardness in Markov Decision Processes: Theory and Practice

Figure 4 for Hardness in Markov Decision Processes: Theory and Practice

Abstract:Meticulously analysing the empirical strengths and weaknesses of reinforcement learning methods in hard (challenging) environments is essential to inspire innovations and assess progress in the field. In tabular reinforcement learning, there is no well-established standard selection of environments to conduct such analysis, which is partially due to the lack of a widespread understanding of the rich theory of hardness of environments. The goal of this paper is to unlock the practical usefulness of this theory through four main contributions. First, we present a systematic survey of the theory of hardness, which also identifies promising research directions. Second, we introduce Colosseum, a pioneering package that enables empirical hardness analysis and implements a principled benchmark composed of environments that are diverse with respect to different measures of hardness. Third, we present an empirical analysis that provides new insights into computable measures. Finally, we benchmark five tabular agents in our newly proposed benchmark. While advancing the theoretical understanding of hardness in non-tabular reinforcement learning remains essential, our contributions in the tabular setting are intended as solid steps towards a principled non-tabular benchmark. Accordingly, we benchmark four agents in non-tabular versions of Colosseum environments, obtaining results that demonstrate the generality of tabular hardness measures.

Via

Access Paper or Ask Questions

Recurrent Neural-Linear Posterior Sampling for Non-Stationary Contextual Bandits

Jul 09, 2020

Aditya Ramesh, Paulo Rauber, Jürgen Schmidhuber

Figure 1 for Recurrent Neural-Linear Posterior Sampling for Non-Stationary Contextual Bandits

Figure 2 for Recurrent Neural-Linear Posterior Sampling for Non-Stationary Contextual Bandits

Figure 3 for Recurrent Neural-Linear Posterior Sampling for Non-Stationary Contextual Bandits

Figure 4 for Recurrent Neural-Linear Posterior Sampling for Non-Stationary Contextual Bandits

Abstract:An agent in a non-stationary contextual bandit problem should balance between exploration and the exploitation of (periodic or structured) patterns present in its previous experiences. Handcrafting an appropriate historical context is an attractive alternative to transform a non-stationary problem into a stationary problem that can be solved efficiently. However, even a carefully designed historical context may introduce spurious relationships or lack a convenient representation of crucial information. In order to address these issues, we propose an approach that learns to represent the relevant context for a decision based solely on the raw history of interactions between the agent and the environment. This approach relies on a combination of features extracted by recurrent neural networks with a contextual linear bandit algorithm based on posterior sampling. Our experiments on a diverse selection of contextual and non-contextual non-stationary problems show that our recurrent approach consistently outperforms its feedforward counterpart, which requires handcrafted historical contexts, while being more widely applicable than conventional non-stationary bandit algorithms.

Via

Access Paper or Ask Questions

Hindsight policy gradients

Feb 20, 2019

Paulo Rauber, Avinash Ummadisingu, Filipe Mutz, Juergen Schmidhuber

Abstract:A reinforcement learning agent that needs to pursue different goals across episodes requires a goal-conditional policy. In addition to their potential to generalize desirable behavior to unseen goals, such policies may also enable higher-level planning based on subgoals. In sparse-reward environments, the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended appears crucial to enable sample efficient learning. However, reinforcement learning agents have only recently been endowed with such capacity for hindsight. In this paper, we demonstrate how hindsight can be introduced to policy gradient methods, generalizing this idea to a broad class of successful algorithms. Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efficiency.

* Accepted to ICLR 2019

Via

Access Paper or Ask Questions