Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Stuart Armstrong

ACE and Diverse Generalization via Selective Disagreement

Sep 09, 2025

Oliver Daniels, Stuart Armstrong, Alexandre Maranhão, Mahirah Fairuz Rahman, Benjamin M. Marlin, Rebecca Gorman

Abstract:Deep neural networks are notoriously sensitive to spurious correlations - where a model learns a shortcut that fails out-of-distribution. Existing work on spurious correlations has often focused on incomplete correlations,leveraging access to labeled instances that break the correlation. But in cases where the spurious correlations are complete, the correct generalization is fundamentally \textit{underspecified}. To resolve this underspecification, we propose learning a set of concepts that are consistent with training data but make distinct predictions on a subset of novel unlabeled inputs. Using a self-training approach that encourages \textit{confident} and \textit{selective} disagreement, our method ACE matches or outperforms existing methods on a suite of complete-spurious correlation benchmarks, while remaining robust to incomplete spurious correlations. ACE is also more configurable than prior approaches, allowing for straight-forward encoding of prior knowledge and principled unsupervised model selection. In an early application to language-model alignment, we find that ACE achieves competitive performance on the measurement tampering detection benchmark \textit{without} access to untrusted measurements. While still subject to important limitations, ACE represents significant progress towards overcoming underspecification.

Via

Access Paper or Ask Questions

CoinRun: Solving Goal Misgeneralisation

Sep 28, 2023

Stuart Armstrong, Alexandre Maranhão, Oliver Daniels-Koch, Patrick Leask, Rebecca Gorman

Abstract:Goal misgeneralisation is a key challenge in AI alignment -- the task of getting powerful Artificial Intelligences to align their goals with human intentions and human morality. In this paper, we show how the ACE (Algorithm for Concept Extrapolation) agent can solve one of the key standard challenges in goal misgeneralisation: the CoinRun challenge. It uses no new reward information in the new environment. This points to how autonomous agents could be trusted to act in human interests, even in novel and critical situations.

Via

Access Paper or Ask Questions

Concept Extrapolation: A Conceptual Primer

Jun 19, 2023

Matija Franklin, Rebecca Gorman, Hal Ashton, Stuart Armstrong

Abstract:This article is a primer on concept extrapolation - the ability to take a concept, a feature, or a goal that is defined in one context and extrapolate it safely to a more general context. Concept extrapolation aims to solve model splintering - a ubiquitous occurrence wherein the features or concepts shift as the world changes over time. Through discussing value splintering and value extrapolation the article argues that concept extrapolation is necessary for Artificial Intelligence alignment.

* Accepted at the AAMAS-23 First International Workshop on Citizen-Centric Multiagent Systems held at the 22nd International Conference on Autonomous Agents and Multiagent Systems, 6 pages

Via

Access Paper or Ask Questions

Recognising the importance of preference change: A call for a coordinated multidisciplinary research effort in the age of AI

Mar 30, 2022

Matija Franklin, Hal Ashton, Rebecca Gorman, Stuart Armstrong

Figure 1 for Recognising the importance of preference change: A call for a coordinated multidisciplinary research effort in the age of AI

Abstract:As artificial intelligence becomes more powerful and a ubiquitous presence in daily life, it is imperative to understand and manage the impact of AI systems on our lives and decisions. Modern ML systems often change user behavior (e.g. personalized recommender systems learn user preferences to deliver recommendations that change online behavior). An externality of behavior change is preference change. This article argues for the establishment of a multidisciplinary endeavor focused on understanding how AI systems change preference: Preference Science. We operationalize preference to incorporate concepts from various disciplines, outlining the importance of meta-preferences and preference-change preferences, and proposing a preliminary framework for how preferences change. We draw a distinction between preference change, permissible preference change, and outright preference manipulation. A diversity of disciplines contribute unique insights to this framework.

* The AAAI-22 Workshop on AI For Behavior Change (AI4BC 2022)
* Accepted at the AAAI-22 Workshop on AI For Behavior Change held at the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22), 7 pages, 1 figure

Via

Access Paper or Ask Questions

The dangers in algorithms learning humans' values and irrationalities

Mar 01, 2022

Rebecca Gorman, Stuart Armstrong

Figure 1 for The dangers in algorithms learning humans' values and irrationalities

Figure 2 for The dangers in algorithms learning humans' values and irrationalities

Figure 3 for The dangers in algorithms learning humans' values and irrationalities

Abstract:For an artificial intelligence (AI) to be aligned with human values (or human preferences), it must first learn those values. AI systems that are trained on human behavior, risk miscategorising human irrationalities as human values -- and then optimising for these irrationalities. Simply learning human values still carries risks: AI learning them will inevitably also gain information on human irrationalities and human behaviour/policy. Both of these can be dangerous: knowing human policy allows an AI to become generically more powerful (whether it is partially aligned or not aligned at all), while learning human irrationalities allows it to exploit humans without needing to provide value in return. This paper analyses the danger in developing artificial intelligence that learns about human irrationalities and human policy, and constructs a model recommendation system with various levels of information about human biases, human policy, and human values. It concludes that, whatever the power and knowledge of the AI, it is more dangerous for it to know human irrationalities than human values. Thus it is better for the AI to learn human values directly, rather than learning human biases and then deducing values from behaviour.

Via

Access Paper or Ask Questions

Chess as a Testing Grounds for the Oracle Approach to AI Safety

Oct 06, 2020

James D. Miller, Roman Yampolskiy, Olle Haggstrom, Stuart Armstrong

Abstract:To reduce the danger of powerful super-intelligent AIs, we might make the first such AIs oracles that can only send and receive messages. This paper proposes a possibly practical means of using machine learning to create two classes of narrow AI oracles that would provide chess advice: those aligned with the player's interest, and those that want the player to lose and give deceptively bad advice. The player would be uncertain which type of oracle it was interacting with. As the oracles would be vastly more intelligent than the player in the domain of chess, experience with these oracles might help us prepare for future artificial general intelligence oracles.

Via

Access Paper or Ask Questions

Pitfalls of learning a reward function online

Apr 28, 2020

Stuart Armstrong, Jan Leike, Laurent Orseau, Shane Legg

Figure 1 for Pitfalls of learning a reward function online

Figure 2 for Pitfalls of learning a reward function online

Figure 3 for Pitfalls of learning a reward function online

Figure 4 for Pitfalls of learning a reward function online

Abstract:In some agent designs like inverse reinforcement learning an agent needs to learn its own reward function. Learning the reward function and optimising for it are typically two different processes, usually performed at different stages. We consider a continual (``one life'') learning approach where the agent both learns the reward function and optimises for it at the same time. We show that this comes with a number of pitfalls, such as deliberately manipulating the learning process in one direction, refusing to learn, ``learning'' facts already known to the agent, and making decisions that are strictly dominated (for all relevant reward functions). We formally introduce two desirable properties: the first is `unriggability', which prevents the agent from steering the learning process in the direction of a reward function that is easier to optimise. The second is `uninfluenceability', whereby the reward-function learning process operates by learning facts about the environment. We show that an uninfluenceable process is automatically unriggable, and if the set of possible environments is sufficiently rich, the converse is true too.

Via

Access Paper or Ask Questions

Occam's razor is insufficient to infer the preferences of irrational agents

Oct 29, 2018

Stuart Armstrong, Sören Mindermann

Abstract:Inverse reinforcement learning (IRL) attempts to infer human rewards or preferences from observed behavior. Since human planning systematically deviates from rationality, several approaches have been tried to account for specific human shortcomings. However, the general problem of inferring the reward function of an agent of unknown rationality has received little attention. Unlike the well-known ambiguity problems in IRL, this one is practically relevant but cannot be resolved by observing the agent's policy in enough environments. This paper shows (1) that a No Free Lunch result implies it is impossible to uniquely decompose a policy into a planning algorithm and reward function, and (2) that even with a reasonable simplicity prior/Occam's razor on the set of decompositions, we cannot distinguish between the true decomposition and others that lead to high regret. To address this, we need simple `normative' assumptions, which cannot be deduced exclusively from observations.

Via

Access Paper or Ask Questions

Good and safe uses of AI Oracles

Jun 05, 2018

Stuart Armstrong, Xavier O'Rorke

Figure 1 for Good and safe uses of AI Oracles

Figure 2 for Good and safe uses of AI Oracles

Figure 3 for Good and safe uses of AI Oracles

Figure 4 for Good and safe uses of AI Oracles

Abstract:It is possible that powerful and potentially dangerous artificial intelligence (AI) might be developed in the future. An Oracle is a design which aims to restrain the impact of a potentially dangerous AI by restricting the agent to no actions besides answering questions. Unfortunately, most Oracles will be motivated to gain more control over the world by manipulating users through the content of their answers, and Oracles of potentially high intelligence might be very successful at this \citep{DBLP:journals/corr/AlfonsecaCACAR16}. In this paper we present two designs for Oracles which, even under pessimistic assumptions, will not manipulate their users into releasing them and yet will still be incentivised to provide their users with helpful answers. The first design is the counterfactual Oracle -- which choses its answer as if it expected nobody to ever read it. The second design is the low-bandwidth Oracle -- which is limited by the quantity of information it can transmit.

* 11 pages, 2 figures

Via

Access Paper or Ask Questions

'Indifference' methods for managing agent rewards

Jun 05, 2018

Stuart Armstrong, Xavier O'Rourke

Figure 1 for 'Indifference' methods for managing agent rewards

Abstract:`Indifference' refers to a class of methods used to control reward based agents. Indifference techniques aim to achieve one or more of three distinct goals: rewards dependent on certain events (without the agent being motivated to manipulate the probability of those events), effective disbelief (where agents behave as if particular events could never happen), and seamless transition from one reward function to another (with the agent acting as if this change is unanticipated). This paper presents several methods for achieving these goals in the POMDP setting, establishing their uses, strengths, and requirements. These methods of control work even when the implications of the agent's reward are otherwise not fully understood.

Via

Access Paper or Ask Questions