Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Noam Brown

Tony

Learned Belief Search: Efficiently Improving Policies in Partially Observable Settings

Jun 16, 2021

Hengyuan Hu, Adam Lerer, Noam Brown, Jakob Foerster

Figure 1 for Learned Belief Search: Efficiently Improving Policies in Partially Observable Settings

Figure 2 for Learned Belief Search: Efficiently Improving Policies in Partially Observable Settings

Figure 3 for Learned Belief Search: Efficiently Improving Policies in Partially Observable Settings

Figure 4 for Learned Belief Search: Efficiently Improving Policies in Partially Observable Settings

Abstract:Search is an important tool for computing effective policies in single- and multi-agent environments, and has been crucial for achieving superhuman performance in several benchmark fully and partially observable games. However, one major limitation of prior search approaches for partially observable environments is that the computational cost scales poorly with the amount of hidden information. In this paper we present \emph{Learned Belief Search} (LBS), a computationally efficient search procedure for partially observable environments. Rather than maintaining an exact belief distribution, LBS uses an approximate auto-regressive counterfactual belief that is learned as a supervised task. In multi-agent settings, LBS uses a novel public-private model architecture for underlying policies in order to efficiently evaluate these policies during rollouts. In the benchmark domain of Hanabi, LBS can obtain 55% ~ 91% of the benefit of exact search while reducing compute requirements by $35.8 \times$ ~ $4.6 \times$, allowing it to scale to larger settings that were inaccessible to previous search methods.

Via

Access Paper or Ask Questions

Off-Belief Learning

Mar 06, 2021

Hengyuan Hu, Adam Lerer, Brandon Cui, Luis Pineda, David Wu, Noam Brown, Jakob Foerster

Abstract:The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and rely on multi-step counterfactual reasoning based on assumptions about other agents' actions and thus fail when paired with humans or independently trained agents. In contrast, no current methods can learn optimal policies that are fully grounded, i.e., do not rely on counterfactual information from observing other agents' actions. To address this, we present off-belief learning} (OBL): at each time step OBL agents assume that all past actions were taken by a given, fixed policy ($\pi_0$), but that future actions will be taken by an optimal policy under these same assumptions. When $\pi_0$ is uniform random, OBL learns the optimal grounded policy. OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next. This introduces counterfactual reasoning in a controlled manner. Unlike independent RL which may converge to any equilibrium policy, OBL converges to a unique policy, making it more suitable for zero-shot coordination. OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a simple toy-setting and the benchmark human-AI/zero-shot coordination problem Hanabi.

Via

Access Paper or Ask Questions

Safe Search for Stackelberg Equilibria in Extensive-Form Games

Feb 02, 2021

Chun Kai Ling, Noam Brown

Figure 1 for Safe Search for Stackelberg Equilibria in Extensive-Form Games

Figure 2 for Safe Search for Stackelberg Equilibria in Extensive-Form Games

Figure 3 for Safe Search for Stackelberg Equilibria in Extensive-Form Games

Figure 4 for Safe Search for Stackelberg Equilibria in Extensive-Form Games

Abstract:Stackelberg equilibrium is a solution concept in two-player games where the leader has commitment rights over the follower. In recent years, it has become a cornerstone of many security applications, including airport patrolling and wildlife poaching prevention. Even though many of these settings are sequential in nature, existing techniques pre-compute the entire solution ahead of time. In this paper, we present a theoretically sound and empirically effective way to apply search, which leverages extra online computation to improve a solution, to the computation of Stackelberg equilibria in general-sum games. Instead of the leader attempting to solve the full game upfront, an approximate "blueprint" solution is first computed offline and is then improved online for the particular subgames encountered in actual play. We prove that our search technique is guaranteed to perform no worse than the pre-computed blueprint strategy, and empirically demonstrate that it enables approximately solving significantly larger games compared to purely offline methods. We also show that our search operation may be cast as a smaller Stackelberg problem, making our method complementary to existing algorithms based on strategy generation.

Via

Access Paper or Ask Questions

Human-Level Performance in No-Press Diplomacy via Equilibrium Search

Oct 06, 2020

Jonathan Gray, Adam Lerer, Anton Bakhtin, Noam Brown

Figure 1 for Human-Level Performance in No-Press Diplomacy via Equilibrium Search

Figure 2 for Human-Level Performance in No-Press Diplomacy via Equilibrium Search

Figure 3 for Human-Level Performance in No-Press Diplomacy via Equilibrium Search

Figure 4 for Human-Level Performance in No-Press Diplomacy via Equilibrium Search

Abstract:Prior AI breakthroughs in complex games have focused on either the purely adversarial or purely cooperative settings. In contrast, Diplomacy is a game of shifting alliances that involves both cooperation and competition. For this reason, Diplomacy has proven to be a formidable research challenge. In this paper we describe an agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via external regret minimization. External regret minimization techniques have been behind previous AI successes in adversarial games, most notably poker, but have not previously been shown to be successful in large-scale games involving cooperation. We show that our agent greatly exceeds the performance of past no-press Diplomacy bots, is unexploitable by expert humans, and achieves a rank of 23 out of 1,128 human players when playing anonymous games on a popular Diplomacy website.

Via

Access Paper or Ask Questions

Combining Deep Reinforcement Learning and Search for Imperfect-Information Games

Jul 27, 2020

Noam Brown, Anton Bakhtin, Adam Lerer, Qucheng Gong

Figure 1 for Combining Deep Reinforcement Learning and Search for Imperfect-Information Games

Figure 2 for Combining Deep Reinforcement Learning and Search for Imperfect-Information Games

Figure 3 for Combining Deep Reinforcement Learning and Search for Imperfect-Information Games

Figure 4 for Combining Deep Reinforcement Learning and Search for Imperfect-Information Games

Abstract:The combination of deep reinforcement learning and search at both training and test time is a powerful paradigm that has led to a number of a successes in single-agent settings and perfect-information games, best exemplified by the success of AlphaZero. However, algorithms of this form have been unable to cope with imperfect-information games. This paper presents ReBeL, a general framework for self-play reinforcement learning and search for imperfect-information games. In the simpler setting of perfect-information games, ReBeL reduces to an algorithm similar to AlphaZero. Results show ReBeL leads to low exploitability in benchmark imperfect-information games and achieves superhuman performance in heads-up no-limit Texas hold'em poker, while using far less domain knowledge than any prior poker AI. We also prove that ReBeL converges to a Nash equilibrium in two-player zero-sum games in tabular settings.

Via

Access Paper or Ask Questions

Unlocking the Potential of Deep Counterfactual Value Networks

Jul 20, 2020

Ryan Zarick, Bryan Pellegrino, Noam Brown, Caleb Banister

Figure 1 for Unlocking the Potential of Deep Counterfactual Value Networks

Figure 2 for Unlocking the Potential of Deep Counterfactual Value Networks

Figure 3 for Unlocking the Potential of Deep Counterfactual Value Networks

Figure 4 for Unlocking the Potential of Deep Counterfactual Value Networks

Abstract:Deep counterfactual value networks combined with continual resolving provide a way to conduct depth-limited search in imperfect-information games. However, since their introduction in the DeepStack poker AI, deep counterfactual value networks have not seen widespread adoption. In this paper we introduce several improvements to deep counterfactual value networks, as well as counterfactual regret minimization, and analyze the effects of each change. We combined these improvements to create the poker AI Supremus. We show that while a reimplementation of DeepStack loses head-to-head against the strong benchmark agent Slumbot, Supremus successfully beats Slumbot by an extremely large margin and also achieves a lower exploitability than DeepStack against a local best response. Together, these results show that with our key improvements, deep counterfactual value networks can achieve state-of-the-art performance.

* 11 pages, 6 figures

Via

Access Paper or Ask Questions

DREAM: Deep Regret minimization with Advantage baselines and Model-free learning

Jun 18, 2020

Eric Steinberger, Adam Lerer, Noam Brown

Figure 1 for DREAM: Deep Regret minimization with Advantage baselines and Model-free learning

Figure 2 for DREAM: Deep Regret minimization with Advantage baselines and Model-free learning

Figure 3 for DREAM: Deep Regret minimization with Advantage baselines and Model-free learning

Figure 4 for DREAM: Deep Regret minimization with Advantage baselines and Model-free learning

Abstract:We introduce DREAM, a deep reinforcement learning algorithm that finds optimal strategies in imperfect-information games with multiple agents. Formally, DREAM converges to a Nash Equilibrium in two-player zero-sum games and to an extensive-form coarse correlated equilibrium in all other games. Our primary innovation is an effective algorithm that, in contrast to other regret-based deep learning algorithms, does not require access to a perfect simulator of the game to achieve good performance. We show that DREAM empirically achieves state-of-the-art performance among model-free algorithms in popular benchmark games, and is even competitive with algorithms that do use a perfect simulator.

Via

Access Paper or Ask Questions

Improving Policies via Search in Cooperative Partially Observable Games

Dec 05, 2019

Adam Lerer, Hengyuan Hu, Jakob Foerster, Noam Brown

Figure 1 for Improving Policies via Search in Cooperative Partially Observable Games

Figure 2 for Improving Policies via Search in Cooperative Partially Observable Games

Figure 3 for Improving Policies via Search in Cooperative Partially Observable Games

Figure 4 for Improving Policies via Search in Cooperative Partially Observable Games

Abstract:Recent superhuman results in games have largely been achieved in a variety of zero-sum settings, such as Go and Poker, in which agents need to compete against others. However, just like humans, real-world AI systems have to coordinate and communicate with other agents in cooperative partially observable environments as well. These settings commonly require participants to both interpret the actions of others and to act in a way that is informative when being interpreted. Those abilities are typically summarized as theory f mind and are seen as crucial for social interactions. In this paper we propose two different search techniques that can be applied to improve an arbitrary agreed-upon policy in a cooperative partially observable game. The first one, single-agent search, effectively converts the problem into a single agent setting by making all but one of the agents play according to the agreed-upon policy. In contrast, in multi-agent search all agents carry out the same common-knowledge search procedure whenever doing so is computationally feasible, and fall back to playing according to the agreed-upon policy otherwise. We prove that these search procedures are theoretically guaranteed to at least maintain the original performance of the agreed-upon policy (up to a bounded approximation error). In the benchmark challenge problem of Hanabi, our search technique greatly improves the performance of every agent we tested and when applied to a policy trained using RL achieves a new state-of-the-art score of 24.61 / 25 in the game, compared to a previous-best of 24.08 / 25.

* AAAI 2020

Via

Access Paper or Ask Questions

Stable-Predictive Optimistic Counterfactual Regret Minimization

Feb 13, 2019

Gabriele Farina, Christian Kroer, Noam Brown, Tuomas Sandholm

Figure 1 for Stable-Predictive Optimistic Counterfactual Regret Minimization

Figure 2 for Stable-Predictive Optimistic Counterfactual Regret Minimization

Abstract:The CFR framework has been a powerful tool for solving large-scale extensive-form games in practice. However, the theoretical rate at which past CFR-based algorithms converge to the Nash equilibrium is on the order of $O(T^{-1/2})$, where $T$ is the number of iterations. In contrast, first-order methods can be used to achieve a $O(T^{-1})$ dependence on iterations, yet these methods have been less successful in practice. In this work we present the first CFR variant that breaks the square-root dependence on iterations. By combining and extending recent advances on predictive and stable regret minimizers for the matrix-game setting we show that it is possible to leverage "optimistic" regret minimizers to achieve a $O(T^{-3/4})$ convergence rate within CFR. This is achieved by introducing a new notion of stable-predictivity, and by setting the stability of each counterfactual regret minimizer relative to its location in the decision tree. Experiments show that this method is faster than the original CFR algorithm, although not as fast as newer variants, in spite of their worst-case $O(T^{-1/2})$ dependence on iterations.

Via

Access Paper or Ask Questions

Deep Counterfactual Regret Minimization

Nov 01, 2018

Noam Brown, Adam Lerer, Sam Gross, Tuomas Sandholm

Figure 1 for Deep Counterfactual Regret Minimization

Figure 2 for Deep Counterfactual Regret Minimization

Figure 3 for Deep Counterfactual Regret Minimization

Figure 4 for Deep Counterfactual Regret Minimization

Abstract:Counterfactual Regret Minimization (CFR) is the leading algorithm for solving large imperfect-information games. It iteratively traverses the game tree in order to converge to a Nash equilibrium. In order to deal with extremely large games, CFR typically uses domain-specific heuristics to simplify the target game in a process known as abstraction. This simplified game is solved with tabular CFR, and its solution is mapped back to the full game. This paper introduces Deep Counterfactual Regret Minimization (Deep CFR), a form of CFR that obviates the need for abstraction by instead using deep neural networks to approximate the behavior of CFR in the full game. We show that Deep CFR is principled and achieves strong performance in the benchmark game of heads-up no-limit Texas hold'em poker. This is the first successful use of function approximation in CFR for large games.

Via

Access Paper or Ask Questions