Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Matteo Pirotta

Adversarial Attacks on Linear Contextual Bandits

Feb 11, 2020

Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, Matteo Pirotta

Figure 1 for Adversarial Attacks on Linear Contextual Bandits

Figure 2 for Adversarial Attacks on Linear Contextual Bandits

Figure 3 for Adversarial Attacks on Linear Contextual Bandits

Figure 4 for Adversarial Attacks on Linear Contextual Bandits

Abstract:Contextual bandit algorithms are applied in a wide range of domains, from advertising to recommender systems, from clinical trials to education. In many of these domains, malicious agents may have incentives to attack the bandit algorithm to induce it to perform a desired behavior. For instance, an unscrupulous ad publisher may try to increase their own revenue at the expense of the advertisers; a seller may want to increase the exposure of their products, or thwart a competitor's advertising campaign. In this paper, we study several attack scenarios and show that a malicious agent can force a linear contextual bandit algorithm to pull any desired arm $T - o(T)$ times over a horizon of $T$ steps, while applying adversarial modifications to either rewards or contexts that only grow logarithmically as $O(\log T)$. We also investigate the case when a malicious agent is interested in affecting the behavior of the bandit algorithm in a single context (e.g., a specific user). We first provide sufficient conditions for the feasibility of the attack and we then propose an efficient algorithm to perform the attack. We validate our theoretical results on experiments performed on both synthetic and real-world datasets.

Via

Access Paper or Ask Questions

Improved Algorithms for Conservative Exploration in Bandits

Feb 08, 2020

Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta

Figure 1 for Improved Algorithms for Conservative Exploration in Bandits

Figure 2 for Improved Algorithms for Conservative Exploration in Bandits

Figure 3 for Improved Algorithms for Conservative Exploration in Bandits

Figure 4 for Improved Algorithms for Conservative Exploration in Bandits

Abstract:In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost never worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.

Via

Access Paper or Ask Questions

Conservative Exploration in Reinforcement Learning

Feb 08, 2020

Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta

Figure 1 for Conservative Exploration in Reinforcement Learning

Figure 2 for Conservative Exploration in Reinforcement Learning

Figure 3 for Conservative Exploration in Reinforcement Learning

Figure 4 for Conservative Exploration in Reinforcement Learning

Abstract:While learning in an unknown Markov Decision Process (MDP), an agent should trade off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward. Although the agent will eventually learn a good or optimal policy, there is no guarantee on the quality of the intermediate policies. This lack of control is undesired in real-world applications where a minimum requirement is that the executed policies are guaranteed to perform at least as well as an existing baseline. In this paper, we introduce the notion of conservative exploration for average reward and finite horizon problems. We present two optimistic algorithms that guarantee (w.h.p.) that the conservative constraint is never violated during learning. We derive regret bounds showing that being conservative does not hinder the learning ability of these algorithms.

Via

Access Paper or Ask Questions

Concentration Inequalities for Multinoulli Random Variables

Jan 30, 2020

Jian Qian, Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

Abstract:We investigate concentration inequalities for Dirichlet and Multinomial random variables.

* Tutorial at ALT'19 on Regret Minimization in Infinite-Horizon Finite Markov Decision Processes

Via

Access Paper or Ask Questions

No-Regret Exploration in Goal-Oriented Reinforcement Learning

Jan 30, 2020

Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, Alessandro Lazaric

Figure 1 for No-Regret Exploration in Goal-Oriented Reinforcement Learning

Figure 2 for No-Regret Exploration in Goal-Oriented Reinforcement Learning

Figure 3 for No-Regret Exploration in Goal-Oriented Reinforcement Learning

Figure 4 for No-Regret Exploration in Goal-Oriented Reinforcement Learning

Abstract:Many popular reinforcement learning problems (e.g., navigation in a maze, some Atari games, mountain car) are instances of the episodic setting under its stochastic shortest path (SSP) formulation, where an agent has to achieve a goal state while minimizing the cumulative cost. Despite the popularity of this setting, the exploration-exploitation dilemma has been sparsely studied in general SSP problems, with most of the theoretical literature focusing on different problems (i.e., fixed-horizon and infinite-horizon) or making the restrictive loop-free SSP assumption (i.e., no state can be visited twice during an episode). In this paper, we study the general SSP problem with no assumption on its dynamics (some policies may actually never reach the goal). We introduce UC-SSP, the first no-regret algorithm in this setting, and prove a regret bound scaling as $\displaystyle \widetilde{\mathcal{O}}( D S \sqrt{ A D K})$ after $K$ episodes for any unknown SSP with $S$ states, $A$ actions, positive costs and SSP-diameter $D$, defined as the smallest expected hitting time from any starting state to the goal. We achieve this result by crafting a novel stopping rule, such that UC-SSP may interrupt the current policy if it is taking too long to achieve the goal and switch to alternative policies that are designed to rapidly terminate the episode.

Via

Access Paper or Ask Questions

Exploiting Language Instructions for Interpretable and Compositional Reinforcement Learning

Jan 13, 2020

Michiel van der Meer, Matteo Pirotta, Elia Bruni

Figure 1 for Exploiting Language Instructions for Interpretable and Compositional Reinforcement Learning

Figure 2 for Exploiting Language Instructions for Interpretable and Compositional Reinforcement Learning

Figure 3 for Exploiting Language Instructions for Interpretable and Compositional Reinforcement Learning

Figure 4 for Exploiting Language Instructions for Interpretable and Compositional Reinforcement Learning

Abstract:In this work, we present an alternative approach to making an agent compositional through the use of a diagnostic classifier. Because of the need for explainable agents in automated decision processes, we attempt to interpret the latent space from an RL agent to identify its current objective in a complex language instruction. Results show that the classification process causes changes in the hidden states which makes them more easily interpretable, but also causes a shift in zero-shot performance to novel instructions. Lastly, we limit the supervisory signal on the classification, and observe a similar but less notable effect.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

Nov 01, 2019

Andrea Zanette, David Brandfonbrener, Matteo Pirotta, Alessandro Lazaric

Figure 1 for Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

Abstract:We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL). When the state space is large or continuous, traditional tabular approaches are unfeasible and some form of function approximation is mandatory. In this paper, we introduce an optimistically-initialized variant of the popular randomized least-squares value iteration (RLSVI), a model-free algorithm where exploration is induced by perturbing the least-squares approximation of the action-value function. Under the assumption that the Markov decision process has low-rank transition dynamics, we prove that the frequentist regret of RLSVI is upper-bounded by $\widetilde O(d^2 H^2 \sqrt{T})$ where $ d $ are the feature dimension, $ H $ is the horizon, and $ T $ is the total number of steps. To the best of our knowledge, this is the first frequentist regret analysis for randomized exploration with function approximation.

Via

Access Paper or Ask Questions

Smoothing Policies and Safe Policy Gradients

May 08, 2019

Matteo Papini, Matteo Pirotta, Marcello Restelli

Figure 1 for Smoothing Policies and Safe Policy Gradients

Figure 2 for Smoothing Policies and Safe Policy Gradients

Abstract:Policy gradient algorithms are among the best candidates for the much anticipated application of reinforcement learning to real-world control tasks, such as the ones arising in robotics. However, the trial-and-error nature of these methods introduces safety issues whenever the learning phase itself must be performed on a physical system. In this paper, we address a specific safety formulation, where danger is encoded in the reward signal and the learning agent is constrained to never worsen its performance. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows to identify those meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimators. By a joint, adaptive selection of these meta-parameters, we obtain a safe policy gradient algorithm.

Via

Access Paper or Ask Questions

Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes

Dec 11, 2018

Jian Qian, Ronan Fruit, Matteo Pirotta, Alessandro Lazaric

Abstract:We introduce and analyse two algorithms for exploration-exploitation in discrete and continuous Markov Decision Processes (MDPs) based on exploration bonuses. SCAL$^+$ is a variant of SCAL (Fruit et al., 2018) that performs efficient exploration-exploitation in any unknown weakly-communicating MDP for which an upper bound C on the span of the optimal bias function is known. For an MDP with $S$ states, $A$ actions and $\Gamma \leq S$ possible next states, we prove that SCAL$^+$ achieves the same theoretical guarantees as SCAL (i.e., a high probability regret bound of $\widetilde{O}(C\sqrt{\Gamma SAT})$), with a much smaller computational complexity. Similarly, C-SCAL$^+$ exploits an exploration bonus to achieve sublinear regret in any undiscounted MDP with continuous state space. We show that C-SCAL$^+$ achieves the same regret bound as UCCRL (Ortner and Ryabko, 2012) while being the first implementable algorithm with regret guarantees in this setting. While optimistic algorithms such as UCRL, SCAL or UCCRL maintain a high-confidence set of plausible MDPs around the true unknown MDP, SCAL$^+$ and C-SCAL$^+$ leverage on an exploration bonus to directly plan on the empirically estimated MDP, thus being more computationally efficient.

Via

Access Paper or Ask Questions

Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

Jul 06, 2018

Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, Ronald Ortner

Figure 1 for Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

Figure 2 for Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

Figure 3 for Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

Figure 4 for Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning

Abstract:We introduce SCAL, an algorithm designed to perform efficient exploration-exploitation in any unknown weakly-communicating Markov decision process (MDP) for which an upper bound $c$ on the span of the optimal bias function is known. For an MDP with $S$ states, $A$ actions and $\Gamma \leq S$ possible next states, we prove a regret bound of $\widetilde{O}(c\sqrt{\Gamma SAT})$, which significantly improves over existing algorithms (e.g., UCRL and PSRL), whose regret scales linearly with the MDP diameter $D$. In fact, the optimal bias span is finite and often much smaller than $D$ (e.g., $D=\infty$ in non-communicating MDPs). A similar result was originally derived by Bartlett and Tewari (2009) for REGAL.C, for which no tractable algorithm is available. In this paper, we relax the optimization problem at the core of REGAL.C, we carefully analyze its properties, and we provide the first computationally efficient algorithm to solve it. Finally, we report numerical simulations supporting our theoretical findings and showing how SCAL significantly outperforms UCRL in MDPs with large diameter and small span.

Via

Access Paper or Ask Questions