Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Katja Hofmann

Imitating Human Behaviour with Diffusion Models

Jan 25, 2023

Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann(+1 more)

Figure 1 for Imitating Human Behaviour with Diffusion Models

Figure 2 for Imitating Human Behaviour with Diffusion Models

Figure 3 for Imitating Human Behaviour with Diffusion Models

Figure 4 for Imitating Human Behaviour with Diffusion Models

Abstract:Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.

* ICLR 2023
* Published in ICLR 2023

Via

Access Paper or Ask Questions

UniMASK: Unified Inference in Sequential Decision Problems

Nov 20, 2022

Micah Carroll, Orr Paradise, Jessy Lin, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Milani, Katja Hofmann, Matthew Hausknecht, Anca Dragan(+1 more)

Figure 1 for UniMASK: Unified Inference in Sequential Decision Problems

Figure 2 for UniMASK: Unified Inference in Sequential Decision Problems

Figure 3 for UniMASK: Unified Inference in Sequential Decision Problems

Figure 4 for UniMASK: Unified Inference in Sequential Decision Problems

Abstract:Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks. In this work, we observe that the same idea also applies naturally to sequential decision-making, where many well-studied tasks like behavior cloning, offline reinforcement learning, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns. We introduce the UniMASK framework, which provides a unified way to specify models which can be trained on many different sequential decision-making tasks. We show that a single UniMASK model is often capable of carrying out many tasks with performance similar to or better than single-task models. Additionally, after fine-tuning, our UniMASK models consistently outperform comparable single-task models. Our code is publicly available at https://github.com/micahcarroll/uniMASK.

* NeurIPS 2022 (Oral). A prior version was published at an ICML Workshop, available at arXiv:2204.13326

Via

Access Paper or Ask Questions

Contextual Squeeze-and-Excitation for Efficient Few-Shot Image Classification

Jun 20, 2022

Massimiliano Patacchiola, John Bronskill, Aliaksandra Shysheya, Katja Hofmann, Sebastian Nowozin, Richard E. Turner

Figure 1 for Contextual Squeeze-and-Excitation for Efficient Few-Shot Image Classification

Figure 2 for Contextual Squeeze-and-Excitation for Efficient Few-Shot Image Classification

Figure 3 for Contextual Squeeze-and-Excitation for Efficient Few-Shot Image Classification

Figure 4 for Contextual Squeeze-and-Excitation for Efficient Few-Shot Image Classification

Abstract:Recent years have seen a growth in user-centric applications that require effective knowledge transfer across tasks in the low-data regime. An example is personalization, where a pretrained system is adapted by learning on small amounts of labeled data belonging to a specific user. This setting requires high accuracy under low computational complexity, therefore the Pareto frontier of accuracy vs.~adaptation cost plays a crucial role. In this paper we push this Pareto frontier in the few-shot image classification setting with two key contributions: (i) a new adaptive block called Contextual Squeeze-and-Excitation (CaSE) that adjusts a pretrained neural network on a new task to significantly improve performance with a single forward pass of the user data (context), and (ii) a hybrid training protocol based on Coordinate-Descent called UpperCaSE that exploits meta-trained CaSE blocks and fine-tuning routines for efficient adaptation. UpperCaSE achieves a new state-of-the-art accuracy relative to meta-learners on the 26 datasets of VTAB+MD and on a challenging real-world personalization benchmark (ORBIT), narrowing the gap with leading fine-tuning methods with the benefit of orders of magnitude lower adaptation cost.

Via

Access Paper or Ask Questions

Interactively Learning Preference Constraints in Linear Bandits

Jun 10, 2022

David Lindner, Sebastian Tschiatschek, Katja Hofmann, Andreas Krause

Figure 1 for Interactively Learning Preference Constraints in Linear Bandits

Figure 2 for Interactively Learning Preference Constraints in Linear Bandits

Figure 3 for Interactively Learning Preference Constraints in Linear Bandits

Figure 4 for Interactively Learning Preference Constraints in Linear Bandits

Abstract:We study sequential decision-making with known rewards and unknown constraints, motivated by situations where the constraints represent expensive-to-evaluate human preferences, such as safe and comfortable driving behavior. We formalize the challenge of interactively learning about these constraints as a novel linear bandit problem which we call constrained linear best-arm identification. To solve this problem, we propose the Adaptive Constraint Learning (ACOL) algorithm. We provide an instance-dependent lower bound for constrained linear best-arm identification and show that ACOL's sample complexity matches the lower bound in the worst-case. In the average case, ACOL's sample complexity bound is still significantly tighter than bounds of simpler approaches. In synthetic experiments, ACOL performs on par with an oracle solution and outperforms a range of baselines. As an application, we consider learning constraints to represent human preferences in a driving simulation. ACOL is significantly more sample efficient than alternatives for this application. Further, we find that learning preferences as constraints is more robust to changes in the driving scenario than encoding the preferences directly in the reward function.

* Accepted to International Conference on Machine Learning (ICML), 2022

Via

Access Paper or Ask Questions

Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers

Apr 28, 2022

Micah Carroll, Jessy Lin, Orr Paradise, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Milani, Katja Hofmann, Matthew Hausknecht, Anca Dragan(+1 more)

Figure 1 for Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers

Figure 2 for Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers

Figure 3 for Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers

Figure 4 for Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers

Abstract:Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks. In this work, we observe that the same idea also applies naturally to sequential decision making, where many well-studied tasks like behavior cloning, offline RL, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns. We introduce the FlexiBiT framework, which provides a unified way to specify models which can be trained on many different sequential decision making tasks. We show that a single FlexiBiT model is simultaneously capable of carrying out many tasks with performance similar to or better than specialized models. Additionally, we show that performance can be further improved by fine-tuning our general model on specific tasks of interest.

Via

Access Paper or Ask Questions

Monotonic Improvement Guarantees under Non-stationarity for Decentralized PPO

Jan 31, 2022

Mingfei Sun, Sam Devlin, Katja Hofmann, Shimon Whiteson

Figure 1 for Monotonic Improvement Guarantees under Non-stationarity for Decentralized PPO

Figure 2 for Monotonic Improvement Guarantees under Non-stationarity for Decentralized PPO

Figure 3 for Monotonic Improvement Guarantees under Non-stationarity for Decentralized PPO

Figure 4 for Monotonic Improvement Guarantees under Non-stationarity for Decentralized PPO

Abstract:We present a new monotonic improvement guarantee for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, i.e., Independent Proximal Policy Optimization (IPPO) and Multi-Agent PPO (MAPPO), which both rely on independent ratios, i.e., computing probability ratios separately for each agent's policy. We show that, despite the non-stationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Moreover, we show that the surrogate objectives optimized in IPPO and MAPPO are essentially equivalent when their critics converge to a fixed point. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such a trust region constraint via clipping in centralized training, and the good values of the hyperparameters for this enforcement are highly sensitive to the number of agents, as predicted by our theoretical analysis.

Via

Access Paper or Ask Questions

You May Not Need Ratio Clipping in PPO

Jan 31, 2022

Mingfei Sun, Vitaly Kurin, Guoqing Liu, Sam Devlin, Tao Qin, Katja Hofmann, Shimon Whiteson

Figure 1 for You May Not Need Ratio Clipping in PPO

Figure 2 for You May Not Need Ratio Clipping in PPO

Figure 3 for You May Not Need Ratio Clipping in PPO

Figure 4 for You May Not Need Ratio Clipping in PPO

Abstract:Proximal Policy Optimization (PPO) methods learn a policy by iteratively performing multiple mini-batch optimization epochs of a surrogate objective with one set of sampled data. Ratio clipping PPO is a popular variant that clips the probability ratios between the target policy and the policy used to collect samples. Ratio clipping yields a pessimistic estimate of the original surrogate objective, and has been shown to be crucial for strong performance. We show in this paper that such ratio clipping may not be a good option as it can fail to effectively bound the ratios. Instead, one can directly optimize the original surrogate objective for multiple epochs; the key is to find a proper condition to early stop the optimization epoch in each iteration. Our theoretical analysis sheds light on how to determine when to stop the optimization epoch, and call the resulting algorithm Early Stopping Policy Optimization (ESPO). We compare ESPO with PPO across many continuous control tasks and show that ESPO significantly outperforms PPO. Furthermore, we show that ESPO can be easily scaled up to distributed training with many workers, delivering strong performance as well.

Via

Access Paper or Ask Questions

Deterministic and Discriminative Imitation (D2-Imitation): Revisiting Adversarial Imitation for Sample Efficiency

Dec 11, 2021

Mingfei Sun, Sam Devlin, Katja Hofmann, Shimon Whiteson

Figure 1 for Deterministic and Discriminative Imitation (D2-Imitation): Revisiting Adversarial Imitation for Sample Efficiency

Figure 2 for Deterministic and Discriminative Imitation (D2-Imitation): Revisiting Adversarial Imitation for Sample Efficiency

Figure 3 for Deterministic and Discriminative Imitation (D2-Imitation): Revisiting Adversarial Imitation for Sample Efficiency

Figure 4 for Deterministic and Discriminative Imitation (D2-Imitation): Revisiting Adversarial Imitation for Sample Efficiency

Abstract:Sample efficiency is crucial for imitation learning methods to be applicable in real-world applications. Many studies improve sample efficiency by extending adversarial imitation to be off-policy regardless of the fact that these off-policy extensions could either change the original objective or involve complicated optimization. We revisit the foundation of adversarial imitation and propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization. Our formulation capitalizes on two key insights: (1) the similarity between the Bellman equation and the stationary state-action distribution equation allows us to derive a novel temporal difference (TD) learning approach; and (2) the use of a deterministic policy simplifies the TD learning. Combined, these insights yield a practical algorithm, Deterministic and Discriminative Imitation (D2-Imitation), which operates by first partitioning samples into two replay buffers and then learning a deterministic policy via off-policy reinforcement learning. Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation on many control tasks.

* AAAI 2022

Via

Access Paper or Ask Questions

NeurIPS 2021 Competition IGLU: Interactive Grounded Language Understanding in a Collaborative Environment

Oct 15, 2021

Julia Kiseleva, Ziming Li, Mohammad Aliannejadi, Shrestha Mohanty, Maartje ter Hoeve, Mikhail Burtsev, Alexey Skrynnik, Artem Zholus, Aleksandr Panov, Kavya Srinet(+5 more)

Figure 1 for NeurIPS 2021 Competition IGLU: Interactive Grounded Language Understanding in a Collaborative Environment

Figure 2 for NeurIPS 2021 Competition IGLU: Interactive Grounded Language Understanding in a Collaborative Environment

Figure 3 for NeurIPS 2021 Competition IGLU: Interactive Grounded Language Understanding in a Collaborative Environment

Figure 4 for NeurIPS 2021 Competition IGLU: Interactive Grounded Language Understanding in a Collaborative Environment

Abstract:Human intelligence has the remarkable ability to adapt to new tasks and environments quickly. Starting from a very young age, humans acquire new skills and learn how to solve new tasks either by imitating the behavior of others or by following provided natural language instructions. To facilitate research in this direction, we propose IGLU: Interactive Grounded Language Understanding in a Collaborative Environment. The primary goal of the competition is to approach the problem of how to build interactive agents that learn to solve a task while provided with grounded natural language instructions in a collaborative environment. Understanding the complexity of the challenge, we split it into sub-tasks to make it feasible for participants. This research challenge is naturally related, but not limited, to two fields of study that are highly relevant to the NeurIPS community: Natural Language Understanding and Generation (NLU/G) and Reinforcement Learning (RL). Therefore, the suggested challenge can bring two communities together to approach one of the important challenges in AI. Another important aspect of the challenge is the dedication to perform a human-in-the-loop evaluation as a final evaluation for the agents developed by contestants.

Via

Access Paper or Ask Questions

Strategically Efficient Exploration in Competitive Multi-agent Reinforcement Learning

Jul 30, 2021

Robert Loftin, Aadirupa Saha, Sam Devlin, Katja Hofmann

Figure 1 for Strategically Efficient Exploration in Competitive Multi-agent Reinforcement Learning

Figure 2 for Strategically Efficient Exploration in Competitive Multi-agent Reinforcement Learning

Figure 3 for Strategically Efficient Exploration in Competitive Multi-agent Reinforcement Learning

Figure 4 for Strategically Efficient Exploration in Competitive Multi-agent Reinforcement Learning

Abstract:High sample complexity remains a barrier to the application of reinforcement learning (RL), particularly in multi-agent systems. A large body of work has demonstrated that exploration mechanisms based on the principle of optimism under uncertainty can significantly improve the sample efficiency of RL in single agent tasks. This work seeks to understand the role of optimistic exploration in non-cooperative multi-agent settings. We will show that, in zero-sum games, optimistic exploration can cause the learner to waste time sampling parts of the state space that are irrelevant to strategic play, as they can only be reached through cooperation between both players. To address this issue, we introduce a formal notion of strategically efficient exploration in Markov games, and use this to develop two strategically efficient learning algorithms for finite Markov games. We demonstrate that these methods can be significantly more sample efficient than their optimistic counterparts.

* To Appear in Uncertainty in Artificial Intelligence (UAI) 2021. 10 figures, 14 pages

Via

Access Paper or Ask Questions