Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matteo Pirotta

Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

Dec 29, 2020

Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric

Figure 1 for Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

Figure 2 for Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

Figure 3 for Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

Figure 4 for Improved Sample Complexity for Incremental Autonomous Exploration in MDPs

Abstract:We investigate the exploration of an unknown environment when no reward function is provided. Building on the incremental exploration setting introduced by Lim and Auer [1], we define the objective of learning the set of $\epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps (in expectation) from a reference state $s_0$. In this paper, we introduce a novel model-based approach that interleaves discovering new states from $s_0$ and improving the accuracy of a model estimate that is used to compute goal-conditioned policies to reach newly discovered states. The resulting algorithm, DisCo, achieves a sample complexity scaling as $\tilde{O}(L^5 S_{L+\epsilon} \Gamma_{L+\epsilon} A \epsilon^{-2})$, where $A$ is the number of actions, $S_{L+\epsilon}$ is the number of states that are incrementally reachable from $s_0$ in $L+\epsilon$ steps, and $\Gamma_{L+\epsilon}$ is the branching factor of the dynamics over such states. This improves over the algorithm proposed in [1] in both $\epsilon$ and $L$ at the cost of an extra $\Gamma_{L+\epsilon}$ factor, which is small in most environments of interest. Furthermore, DisCo is the first algorithm that can return an $\epsilon/c_{\min}$-optimal policy for any cost-sensitive shortest-path problem defined on the $L$-reachable states with minimum cost $c_{\min}$. Finally, we report preliminary empirical results confirming our theoretical findings.

* NeurIPS 2020

Via

Access Paper or Ask Questions

An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits

Oct 23, 2020

Andrea Tirinzoni, Matteo Pirotta, Marcello Restelli, Alessandro Lazaric

Figure 1 for An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits

Figure 2 for An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits

Figure 3 for An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits

Figure 4 for An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits

Abstract:In the contextual linear bandit setting, algorithms built on the optimism principle fail to exploit the structure of the problem and have been shown to be asymptotically suboptimal. In this paper, we follow recent approaches of deriving asymptotically optimal algorithms from problem-dependent regret lower bounds and we introduce a novel algorithm improving over the state-of-the-art along multiple dimensions. We build on a reformulation of the lower bound, where context distribution and exploration policy are decoupled, and we obtain an algorithm robust to unbalanced context distributions. Then, using an incremental primal-dual approach to solve the Lagrangian relaxation of the lower bound, we obtain a scalable and computationally efficient algorithm. Finally, we remove forced exploration and build on confidence intervals of the optimization problem to encourage a minimum level of exploration that is better adapted to the problem structure. We demonstrate the asymptotic optimality of our algorithm, while providing both problem-dependent and worst-case finite-time regret guarantees. Our bounds scale with the logarithm of the number of arms, thus avoiding the linear dependence common in all related prior works. Notably, we establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits. Finally, we verify that our algorithm obtains better empirical performance than state-of-the-art baselines.

* To appear at NeurIPS 2020

Via

Access Paper or Ask Questions

Local Differentially Private Regret Minimization in Reinforcement Learning

Oct 15, 2020

Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, Matteo Pirotta

Figure 1 for Local Differentially Private Regret Minimization in Reinforcement Learning

Figure 2 for Local Differentially Private Regret Minimization in Reinforcement Learning

Figure 3 for Local Differentially Private Regret Minimization in Reinforcement Learning

Figure 4 for Local Differentially Private Regret Minimization in Reinforcement Learning

Abstract:Reinforcement learning algorithms are widely used in domains where it is desirable to provide a personalized service. In these domains it is common that user data contains sensitive information that needs to be protected from third parties. Motivated by this, we study privacy in the context of finite-horizon Markov Decision Processes (MDPs) by requiring information to be obfuscated on the user side. We formulate this notion of privacy for RL by leveraging the local differential privacy (LDP) framework. We present an optimistic algorithm that simultaneously satisfies LDP requirements, and achieves sublinear regret. We also establish a lower bound for regret minimization in finite-horizon MDPs with LDP guarantees. These results show that while LDP is appealing in practical applications, the setting is inherently more complex. In particular, our results demonstrate that the cost of privacy is multiplicative when compared to non-private settings.

Via

Access Paper or Ask Questions

A Provably Efficient Sample Collection Strategy for Reinforcement Learning

Jul 13, 2020

Jean Tarbouriech, Matteo Pirotta, Michal Valko, Alessandro Lazaric

Figure 1 for A Provably Efficient Sample Collection Strategy for Reinforcement Learning

Figure 2 for A Provably Efficient Sample Collection Strategy for Reinforcement Learning

Figure 3 for A Provably Efficient Sample Collection Strategy for Reinforcement Learning

Figure 4 for A Provably Efficient Sample Collection Strategy for Reinforcement Learning

Abstract:A common assumption in reinforcement learning (RL) is to have access to a generative model (i.e., a simulator of the environment), which allows to generate samples from any desired state-action pair. Nonetheless, in many settings a generative model may not be available and an adaptive exploration strategy is needed to efficiently collect samples from an unknown environment by direct interaction. In this paper, we study the scenario where an algorithm based on the generative model assumption defines the (possibly time-varying) amount of samples $b(s,a)$ required at each state-action pair $(s,a)$ and an exploration strategy has to learn how to generate $b(s,a)$ samples as fast as possible. Building on recent results for regret minimization in the stochastic shortest path (SSP) setting (Cohen et al., 2020; Tarbouriech et al., 2020), we derive an algorithm that requires $\tilde{O}( B D + D^{3/2} S^2 A)$ time steps to collect the $B = \sum_{s,a} b(s,a)$ desired samples, in any unknown and communicating MDP with $S$ states, $A$ actions and diameter $D$. Leveraging the generality of our strategy, we readily apply it to a variety of existing settings (e.g., model estimation, pure exploration in MDPs) for which we obtain improved sample-complexity guarantees, and to a set of new problems such as best-state identification and sparse reward discovery.

Via

Access Paper or Ask Questions

Improved Analysis of UCRL2 with Empirical Bernstein Inequality

Jul 10, 2020

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

Figure 1 for Improved Analysis of UCRL2 with Empirical Bernstein Inequality

Abstract:We consider the problem of exploration-exploitation in communicating Markov Decision Processes. We provide an analysis of UCRL2 with Empirical Bernstein inequalities (UCRL2B). For any MDP with $S$ states, $A$ actions, $\Gamma \leq S$ next states and diameter $D$, the regret of UCRL2B is bounded as $\widetilde{O}(\sqrt{D\Gamma S A T})$.

* Document in support of the tutorial at ALT 2019

Via

Access Paper or Ask Questions

A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces

Jul 09, 2020

Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

Figure 1 for A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces

Figure 2 for A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces

Figure 3 for A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces

Figure 4 for A Kernel-Based Approach to Non-Stationary Reinforcement Learning in Metric Spaces

Abstract:In this work, we propose KeRNS: an algorithm for episodic reinforcement learning in non-stationary Markov Decision Processes (MDPs) whose state-action set is endowed with a metric. Using a non-parametric model of the MDP built with time-dependent kernels, we prove a regret bound that scales with the covering dimension of the state-action space and the total variation of the MDP with time, which quantifies its level of non-stationarity. Our method generalizes previous approaches based on sliding windows and exponential discounting used to handle changing environments. We further propose a practical implementation of KeRNS, we analyze its regret and validate it experimentally.

Via

Access Paper or Ask Questions

Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization

May 06, 2020

Pierre-Alexandre Kamienny, Matteo Pirotta, Alessandro Lazaric, Thibault Lavril, Nicolas Usunier, Ludovic Denoyer

Figure 1 for Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization

Figure 2 for Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization

Figure 3 for Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization

Figure 4 for Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization

Abstract:We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments, where the task may change over time. While RNN-based policies could in principle represent such strategies, in practice their training time is prohibitive and the learning process often converges to poor solutions. In this paper, we consider the case where the agent has access to a description of the task (e.g., a task id or task parameters) at training time, but not at test time. We propose a novel algorithm that regularizes the training of an RNN-based policy using informed policies trained to maximize the reward in each task. This dramatically reduces the sample complexity of training RNN-based policies, without losing their representational power. As a result, our method learns exploration strategies that efficiently balance between gathering information about the unknown and changing task and maximizing the reward over time. We test the performance of our algorithm in a variety of environments where tasks may vary within each episode.

* 18 pages

Via

Access Paper or Ask Questions

Regret Bounds for Kernel-Based Reinforcement Learning

Apr 12, 2020

Omar Darwiche Domingues, Pierre Ménard, Matteo Pirotta, Emilie Kaufmann, Michal Valko

Figure 1 for Regret Bounds for Kernel-Based Reinforcement Learning

Figure 2 for Regret Bounds for Kernel-Based Reinforcement Learning

Figure 3 for Regret Bounds for Kernel-Based Reinforcement Learning

Abstract:We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. Unlike existing approaches with regret guarantees, it does not use any kind of partitioning of the state-action space. For problems with $K$ episodes and horizon $H$, we provide a regret bound of $O\left( H^3 K^{\max\left(\frac{1}{2}, \frac{2d}{2d+1}\right)}\right)$, where $d$ is the covering dimension of the joint state-action space. We empirically validate Kernel-UCBVI on discrete and continuous MDPs.

Via

Access Paper or Ask Questions

Active Model Estimation in Markov Decision Processes

Mar 06, 2020

Jean Tarbouriech, Shubhanshu Shekhar, Matteo Pirotta, Mohammad Ghavamzadeh, Alessandro Lazaric

Figure 1 for Active Model Estimation in Markov Decision Processes

Figure 2 for Active Model Estimation in Markov Decision Processes

Figure 3 for Active Model Estimation in Markov Decision Processes

Figure 4 for Active Model Estimation in Markov Decision Processes

Abstract:We study the problem of efficient exploration in order to learn an accurate model of an environment, modeled as a Markov decision process (MDP). Efficient exploration in this problem requires the agent to identify the regions in which estimating the model is more difficult and then exploit this knowledge to collect more samples there. In this paper, we formalize this problem, introduce the first algorithm to learn an $\epsilon$-accurate estimate of the dynamics, and provide its sample complexity analysis. While this algorithm enjoys strong guarantees in the large-sample regime, it tends to have a poor performance in early stages of exploration. To address this issue, we propose an algorithm that is based on maximum weighted entropy, a heuristic that stems from common sense and our theoretical analysis. The main idea here is cover the entire state-action space with the weight proportional to the noise in the transitions. Using a number of simple domains with heterogeneous noise in their transitions, we show that our heuristic-based algorithm outperforms both our original algorithm and the maximum entropy algorithm in the small sample regime, while achieving similar asymptotic performance as that of the original algorithm.

Via

Access Paper or Ask Questions

Exploration-Exploitation in Constrained MDPs

Mar 04, 2020

Yonathan Efroni, Shie Mannor, Matteo Pirotta

Figure 1 for Exploration-Exploitation in Constrained MDPs

Abstract:In many sequential decision-making problems, the goal is to optimize a utility function while satisfying a set of constraints on different utilities. This learning problem is formalized through Constrained Markov Decision Processes (CMDPs). In this paper, we investigate the exploration-exploitation dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward while satisfying the constraints. While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process. In this work, we analyze two approaches for learning in CMDPs. The first approach leverages the linear formulation of CMDP to perform optimistic planning at each episode. The second approach leverages the dual formulation (or saddle-point formulation) of CMDP to perform incremental, optimistic updates of the primal and dual variables. We show that both achieves sublinear regret w.r.t.\ the main utility while having a sublinear regret on the constraint violations. That being said, we highlight a crucial difference between the two approaches; the linear programming approach results in stronger guarantees than in the dual formulation based approach.

Via

Access Paper or Ask Questions