Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Corrupted Contextual Bandits with Action Order Constraints

Nov 16, 2020

Alexander Galozy, Slawomir Nowaczyk, Mattias Ohlsson

Figure 1 for Corrupted Contextual Bandits with Action Order Constraints

Figure 2 for Corrupted Contextual Bandits with Action Order Constraints

Figure 3 for Corrupted Contextual Bandits with Action Order Constraints

Figure 4 for Corrupted Contextual Bandits with Action Order Constraints

Share this with someone who'll enjoy it:

Abstract:We consider a variant of the novel contextual bandit problem with corrupted context, which we call the contextual bandit problem with corrupted context and action correlation, where actions exhibit a relationship structure that can be exploited to guide the exploration of viable next decisions. Our setting is primarily motivated by adaptive mobile health interventions and related applications, where users might transitions through different stages requiring more targeted action selection approaches. In such settings, keeping user engagement is paramount for the success of interventions and therefore it is vital to provide relevant recommendations in a timely manner. The context provided by users might not always be informative at every decision point and standard contextual approaches to action selection will incur high regret. We propose a meta-algorithm using a referee that dynamically combines the policies of a contextual bandit and multi-armed bandit, similar to previous work, as wells as a simple correlation mechanism that captures action to action transition probabilities allowing for more efficient exploration of time-correlated actions. We evaluate empirically the performance of said algorithm on a simulation where the sequence of best actions is determined by a hidden state that evolves in a Markovian manner. We show that the proposed meta-algorithm improves upon regret in situations where the performance of both policies varies such that one is strictly superior to the other for a given time period. To demonstrate that our setting has relevant practical applicability, we evaluate our method on several real world data sets, clearly showing better empirical performance compared to a set of simple algorithms.

View paper on

Share this with someone who'll enjoy it:

Title:Corrupted Contextual Bandits with Action Order Constraints

Paper and Code