Alert button
Picture for Shane Legg

Shane Legg

Alert button

The Hydra Effect: Emergent Self-repair in Language Model Computations

Jul 28, 2023
Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, Shane Legg

We investigate the internal structure of language model computations using causal analysis and demonstrate two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to compensate (which we term the Hydra effect) and (2) a counterbalancing function of late MLP layers that act to downregulate the maximum-likelihood token. Our ablation studies demonstrate that language model layers are typically relatively loosely coupled (ablations to one layer only affect a small number of downstream layers). Surprisingly, these effects occur even in language models trained without any form of dropout. We analyse these effects in the context of factual recall and consider their implications for circuit-level attribution in language models.

Viaarxiv icon

Randomized Positional Encodings Boost Length Generalization of Transformers

May 26, 2023
Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, Joel Veness

Figure 1 for Randomized Positional Encodings Boost Length Generalization of Transformers
Figure 2 for Randomized Positional Encodings Boost Length Generalization of Transformers
Figure 3 for Randomized Positional Encodings Boost Length Generalization of Transformers
Figure 4 for Randomized Positional Encodings Boost Length Generalization of Transformers

Transformers have impressive generalization capabilities on tasks with a fixed context length. However, they fail to generalize to sequences of arbitrary length, even for seemingly simple tasks such as duplicating a string. Moreover, simply training on longer sequences is inefficient due to the quadratic computation complexity of the global attention mechanism. In this work, we demonstrate that this failure mode is linked to positional encodings being out-of-distribution for longer sequences (even for relative encodings) and introduce a novel family of positional encodings that can overcome this problem. Concretely, our randomized positional encoding scheme simulates the positions of longer sequences and randomly selects an ordered subset to fit the sequence's length. Our large-scale empirical evaluation of 6000 models across 15 algorithmic reasoning tasks shows that our method allows Transformers to generalize to sequences of unseen length (increasing test accuracy by 12.0% on average).

Viaarxiv icon

Beyond Bayes-optimality: meta-learning what you know you don't know

Oct 12, 2022
Jordi Grau-Moya, Grégoire Delétang, Markus Kunesch, Tim Genewein, Elliot Catt, Kevin Li, Anian Ruoss, Chris Cundy, Joel Veness, Jane Wang, Marcus Hutter, Christopher Summerfield, Shane Legg, Pedro Ortega

Figure 1 for Beyond Bayes-optimality: meta-learning what you know you don't know
Figure 2 for Beyond Bayes-optimality: meta-learning what you know you don't know
Figure 3 for Beyond Bayes-optimality: meta-learning what you know you don't know
Figure 4 for Beyond Bayes-optimality: meta-learning what you know you don't know

Meta-training agents with memory has been shown to culminate in Bayes-optimal agents, which casts Bayes-optimality as the implicit solution to a numerical optimization problem rather than an explicit modeling assumption. Bayes-optimal agents are risk-neutral, since they solely attune to the expected return, and ambiguity-neutral, since they act in new situations as if the uncertainty were known. This is in contrast to risk-sensitive agents, which additionally exploit the higher-order moments of the return, and ambiguity-sensitive agents, which act differently when recognizing situations in which they lack knowledge. Humans are also known to be averse to ambiguity and sensitive to risk in ways that aren't Bayes-optimal, indicating that such sensitivity can confer advantages, especially in safety-critical situations. How can we extend the meta-learning protocol to generate risk- and ambiguity-sensitive agents? The goal of this work is to fill this gap in the literature by showing that risk- and ambiguity-sensitivity also emerge as the result of an optimization problem using modified meta-training algorithms, which manipulate the experience-generation process of the learner. We empirically test our proposed meta-training algorithms on agents exposed to foundational classes of decision-making experiments and demonstrate that they become sensitive to risk and ambiguity.

* 33 pages, 8 figures, technical report 
Viaarxiv icon

Neural Networks and the Chomsky Hierarchy

Jul 05, 2022
Grégoire Delétang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Marcus Hutter, Shane Legg, Pedro A. Ortega

Figure 1 for Neural Networks and the Chomsky Hierarchy
Figure 2 for Neural Networks and the Chomsky Hierarchy
Figure 3 for Neural Networks and the Chomsky Hierarchy
Figure 4 for Neural Networks and the Chomsky Hierarchy

Reliable generalization lies at the heart of safe ML and AI. However, understanding when and how neural networks generalize remains one of the most important unsolved problems in the field. In this work, we conduct an extensive empirical study (2200 models, 16 tasks) to investigate whether insights from the theory of computation can predict the limits of neural network generalization in practice. We demonstrate that grouping tasks according to the Chomsky hierarchy allows us to forecast whether certain architectures will be able to generalize to out-of-distribution inputs. This includes negative results where even extensive amounts of data and training time never led to any non-trivial generalization, despite models having sufficient capacity to perfectly fit the training data. Our results show that, for our subset of tasks, RNNs and Transformers fail to generalize on non-regular tasks, LSTMs can solve regular and counter-language tasks, and only networks augmented with structured memory (such as a stack or memory tape) can successfully generalize on context-free and context-sensitive tasks.

Viaarxiv icon

Your Policy Regularizer is Secretly an Adversary

Apr 01, 2022
Rob Brekelmans, Tim Genewein, Jordi Grau-Moya, Grégoire Delétang, Markus Kunesch, Shane Legg, Pedro Ortega

Figure 1 for Your Policy Regularizer is Secretly an Adversary
Figure 2 for Your Policy Regularizer is Secretly an Adversary
Figure 3 for Your Policy Regularizer is Secretly an Adversary
Figure 4 for Your Policy Regularizer is Secretly an Adversary

Policy regularization methods such as maximum entropy regularization are widely used in reinforcement learning to improve the robustness of a learned policy. In this paper, we show how this robustness arises from hedging against worst-case perturbations of the reward function, which are chosen from a limited set by an imagined adversary. Using convex duality, we characterize this robust set of adversarial reward perturbations under KL and alpha-divergence regularization, which includes Shannon and Tsallis entropy regularization as special cases. Importantly, generalization guarantees can be given within this robust set. We provide detailed discussion of the worst-case reward perturbations, and present intuitive empirical examples to illustrate this robustness and its relationship with generalization. Finally, we discuss how our analysis complements and extends previous results on adversarial reward robustness and path consistency optimality conditions.

* 10 pages main text; added worked example 
Viaarxiv icon

Safe Deep RL in 3D Environments using Human Feedback

Jan 21, 2022
Matthew Rahtz, Vikrant Varma, Ramana Kumar, Zachary Kenton, Shane Legg, Jan Leike

Figure 1 for Safe Deep RL in 3D Environments using Human Feedback
Figure 2 for Safe Deep RL in 3D Environments using Human Feedback
Figure 3 for Safe Deep RL in 3D Environments using Human Feedback
Figure 4 for Safe Deep RL in 3D Environments using Human Feedback

Agents should avoid unsafe behaviour during both training and deployment. This typically requires a simulator and a procedural specification of unsafe behaviour. Unfortunately, a simulator is not always available, and procedurally specifying constraints can be difficult or impossible for many real-world tasks. A recently introduced technique, ReQueST, aims to solve this problem by learning a neural simulator of the environment from safe human trajectories, then using the learned simulator to efficiently learn a reward model from human feedback. However, it is yet unknown whether this approach is feasible in complex 3D environments with feedback obtained from real humans - whether sufficient pixel-based neural simulator quality can be achieved, and whether the human data requirements are viable in terms of both quantity and quality. In this paper we answer this question in the affirmative, using ReQueST to train an agent to perform a 3D first-person object collection task using data entirely from human contractors. We show that the resulting agent exhibits an order of magnitude reduction in unsafe behaviour compared to standard reinforcement learning.

Viaarxiv icon

Model-Free Risk-Sensitive Reinforcement Learning

Nov 04, 2021
Grégoire Delétang, Jordi Grau-Moya, Markus Kunesch, Tim Genewein, Rob Brekelmans, Shane Legg, Pedro A. Ortega

Figure 1 for Model-Free Risk-Sensitive Reinforcement Learning
Figure 2 for Model-Free Risk-Sensitive Reinforcement Learning
Figure 3 for Model-Free Risk-Sensitive Reinforcement Learning
Figure 4 for Model-Free Risk-Sensitive Reinforcement Learning

We extend temporal-difference (TD) learning in order to obtain risk-sensitive, model-free reinforcement learning algorithms. This extension can be regarded as modification of the Rescorla-Wagner rule, where the (sigmoidal) stimulus is taken to be either the event of over- or underestimating the TD target. As a result, one obtains a stochastic approximation rule for estimating the free energy from i.i.d. samples generated by a Gaussian distribution with unknown mean and variance. Since the Gaussian free energy is known to be a certainty-equivalent sensitive to the mean and the variance, the learning rule has applications in risk-sensitive decision-making.

* DeepMind Tech Report: 13 pages, 4 figures 
Viaarxiv icon

Shaking the foundations: delusions in sequence models for interaction and control

Oct 20, 2021
Pedro A. Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, Tom Everitt, Corentin Tallec, Emilio Parisotto, Tom Erez, Yutian Chen, Scott Reed, Marcus Hutter, Nando de Freitas, Shane Legg

Figure 1 for Shaking the foundations: delusions in sequence models for interaction and control
Figure 2 for Shaking the foundations: delusions in sequence models for interaction and control
Figure 3 for Shaking the foundations: delusions in sequence models for interaction and control
Figure 4 for Shaking the foundations: delusions in sequence models for interaction and control

The recent phenomenal success of language models has reinvigorated machine learning research, and large sequence models such as transformers are being applied to a variety of domains. One important problem class that has remained relatively elusive however is purposeful adaptive behavior. Currently there is a common perception that sequence models "lack the understanding of the cause and effect of their actions" leading them to draw incorrect inferences due to auto-suggestive delusions. In this report we explain where this mismatch originates, and show that it can be resolved by treating actions as causal interventions. Finally, we show that in supervised learning, one can teach a system to condition or intervene on data by training with factual and counterfactual error signals respectively.

* DeepMind Tech Report, 16 pages, 4 figures 
Viaarxiv icon

Causal Analysis of Agent Behavior for AI Safety

Mar 05, 2021
Grégoire Déletang, Jordi Grau-Moya, Miljan Martic, Tim Genewein, Tom McGrath, Vladimir Mikulik, Markus Kunesch, Shane Legg, Pedro A. Ortega

Figure 1 for Causal Analysis of Agent Behavior for AI Safety
Figure 2 for Causal Analysis of Agent Behavior for AI Safety
Figure 3 for Causal Analysis of Agent Behavior for AI Safety
Figure 4 for Causal Analysis of Agent Behavior for AI Safety

As machine learning systems become more powerful they also become increasingly unpredictable and opaque. Yet, finding human-understandable explanations of how they work is essential for their safe deployment. This technical report illustrates a methodology for investigating the causal mechanisms that drive the behaviour of artificial agents. Six use cases are covered, each addressing a typical question an analyst might ask about an agent. In particular, we show that each question cannot be addressed by pure observation alone, but instead requires conducting experiments with systematically chosen manipulations so as to generate the correct causal evidence.

* 16 pages, 16 figures, 6 tables 
Viaarxiv icon

Agent Incentives: A Causal Perspective

Feb 02, 2021
Tom Everitt, Ryan Carey, Eric Langlois, Pedro A Ortega, Shane Legg

Figure 1 for Agent Incentives: A Causal Perspective
Figure 2 for Agent Incentives: A Causal Perspective
Figure 3 for Agent Incentives: A Causal Perspective
Figure 4 for Agent Incentives: A Causal Perspective

We present a framework for analysing agent incentives using causal influence diagrams. We establish that a well-known criterion for value of information is complete. We propose a new graphical criterion for value of control, establishing its soundness and completeness. We also introduce two new concepts for incentive analysis: response incentives indicate which changes in the environment affect an optimal decision, while instrumental control incentives establish whether an agent can influence its utility via a variable X. For both new concepts, we provide sound and complete graphical criteria. We show by example how these results can help with evaluating the safety and fairness of an AI system.

* In Proceedings of the AAAI 2021 Conference. Supersedes arXiv:1902.09980, arXiv:2001.07118 
Viaarxiv icon