Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sheila A. McIlraith

Gauss-Newton Unlearning for the LLM Era

Feb 11, 2026

Lev McKinney, Anvith Thudi, Juhan Bae, Tara Rezaei, Nicolas Papernot, Sheila A. McIlraith, Roger Grosse

Abstract:Standard large language model training can create models that produce outputs their trainer deems unacceptable in deployment. The probability of these outputs can be reduced using methods such as LLM unlearning. However, unlearning a set of data (called the forget set) can degrade model performance on other distributions where the trainer wants to retain the model's behavior. To improve this trade-off, we demonstrate that using the forget set to compute only a few uphill Gauss-Newton steps provides a conceptually simple, state-of-the-art unlearning approach for LLMs. While Gauss-Newton steps adapt Newton's method to non-linear models, it is non-trivial to efficiently and accurately compute such steps for LLMs. Hence, our approach crucially relies on parametric Hessian approximations such as Kronecker-Factored Approximate Curvature (K-FAC). We call this combined approach K-FADE (K-FAC for Distribution Erasure). Our evaluation on the WMDP and ToFU benchmarks demonstrates that K-FADE suppresses outputs from the forget set and approximates, in output space, the results of retraining without the forget set. Critically, our method does this while altering the outputs on the retain set less than previous methods. This is because K-FADE transforms a constraint on the model's outputs across the entire retain set into a constraint on the model's weights, allowing the algorithm to minimally change the model's behavior on the retain set at each step. Moreover, the unlearning updates computed by K-FADE can be reapplied later if the model undergoes further training, allowing unlearning to be cheaply maintained.

* 18 pages

Via

Access Paper or Ask Questions

Satisficing and Optimal Generalised Planning via Goal Regression (Extended Version)

Nov 14, 2025

Dillon Z. Chen, Till Hofmann, Toryn Q. Klassen, Sheila A. McIlraith

Abstract:Generalised planning (GP) refers to the task of synthesising programs that solve families of related planning problems. We introduce a novel, yet simple method for GP: given a set of training problems, for each problem, compute an optimal plan for each goal atom in some order, perform goal regression on the resulting plans, and lift the corresponding outputs to obtain a set of first-order $\textit{Condition} \rightarrow \textit{Actions}$ rules. The rules collectively constitute a generalised plan that can be executed as is or alternatively be used to prune the planning search space. We formalise and prove the conditions under which our method is guaranteed to learn valid generalised plans and state space pruning axioms for search. Experiments demonstrate significant improvements over state-of-the-art (generalised) planners with respect to the 3 metrics of synthesis cost, planning coverage, and solution quality on various classical and numeric planning domains.

* Extended version of AAAI 2026 paper

Via

Access Paper or Ask Questions

Pushdown Reward Machines for Reinforcement Learning

Aug 09, 2025

Giovanni Varricchione, Toryn Q. Klassen, Natasha Alechina, Mehdi Dastani, Brian Logan, Sheila A. McIlraith

Figure 1 for Pushdown Reward Machines for Reinforcement Learning

Figure 2 for Pushdown Reward Machines for Reinforcement Learning

Figure 3 for Pushdown Reward Machines for Reinforcement Learning

Figure 4 for Pushdown Reward Machines for Reinforcement Learning

Abstract:Reward machines (RMs) are automata structures that encode (non-Markovian) reward functions for reinforcement learning (RL). RMs can reward any behaviour representable in regular languages and, when paired with RL algorithms that exploit RM structure, have been shown to significantly improve sample efficiency in many domains. In this work, we present pushdown reward machines (pdRMs), an extension of reward machines based on deterministic pushdown automata. pdRMs can recognize and reward temporally extended behaviours representable in deterministic context-free languages, making them more expressive than reward machines. We introduce two variants of pdRM-based policies, one which has access to the entire stack of the pdRM, and one which can only access the top $k$ symbols (for a given constant $k$) of the stack. We propose a procedure to check when the two kinds of policies (for a given environment, pdRM, and constant $k$) achieve the same optimal expected reward. We then provide theoretical results establishing the expressive power of pdRMs, and space complexity results about the proposed learning problems. Finally, we provide experimental results showing how agents can be trained to perform tasks representable in deterministic context-free languages using pdRMs.

Via

Access Paper or Ask Questions

Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers

Feb 27, 2025

Shalev Lifshitz, Sheila A. McIlraith, Yilun Du

Figure 1 for Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers

Figure 2 for Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers

Figure 3 for Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers

Figure 4 for Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers

Abstract:By utilizing more computational resources at test-time, large language models (LLMs) can improve without additional training. One common strategy uses verifiers to evaluate candidate outputs. In this work, we propose a novel scaling dimension for test-time compute: scaling the number of verifiers. We introduce Multi-Agent Verification (MAV) as a test-time compute paradigm that combines multiple verifiers to improve performance. We propose using Aspect Verifiers (AVs), off-the-shelf LLMs prompted to verify different aspects of outputs, as one possible choice for the verifiers in a MAV system. AVs are a convenient building block for MAV since they can be easily combined without additional training. Moreover, we introduce BoN-MAV, a simple multi-agent verification algorithm that combines best-of-n sampling with multiple verifiers. BoN-MAV demonstrates stronger scaling patterns than self-consistency and reward model verification, and we demonstrate both weak-to-strong generalization, where combining weak verifiers improves even stronger LLMs, and self-improvement, where the same base model is used to both generate and verify outputs. Our results establish scaling the number of verifiers as a promising new dimension for improving language model performance at test-time.

Via

Access Paper or Ask Questions

Pluralistic Alignment Over Time

Nov 16, 2024

Toryn Q. Klassen, Parand A. Alamdari, Sheila A. McIlraith

Figure 1 for Pluralistic Alignment Over Time

Figure 2 for Pluralistic Alignment Over Time

Abstract:If an AI system makes decisions over time, how should we evaluate how aligned it is with a group of stakeholders (who may have conflicting values and preferences)? In this position paper, we advocate for consideration of temporal aspects including stakeholders' changing levels of satisfaction and their possibly temporally extended preferences. We suggest how a recent approach to evaluating fairness over time could be applied to a new form of pluralistic alignment: temporal pluralism, where the AI system reflects different stakeholders' values at different times.

* Pluralistic Alignment Workshop at NeurIPS 2024

Via

Access Paper or Ask Questions

Being Considerate as a Pathway Towards Pluralistic Alignment for Agentic AI

Nov 15, 2024

Parand A. Alamdari, Toryn Q. Klassen, Rodrigo Toro Icarte, Sheila A. McIlraith

Figure 1 for Being Considerate as a Pathway Towards Pluralistic Alignment for Agentic AI

Abstract:Pluralistic alignment is concerned with ensuring that an AI system's objectives and behaviors are in harmony with the diversity of human values and perspectives. In this paper we study the notion of pluralistic alignment in the context of agentic AI, and in particular in the context of an agent that is trying to learn a policy in a manner that is mindful of the values and perspective of others in the environment. To this end, we show how being considerate of the future wellbeing and agency of other (human) agents can promote a form of pluralistic alignment.

* Pluralistic Alignment Workshop at NeurIPS 2024

Via

Access Paper or Ask Questions

Reward Machines for Deep RL in Noisy and Uncertain Environments

May 31, 2024

Andrew C. Li, Zizhao Chen, Toryn Q. Klassen, Pashootan Vaezipoor, Rodrigo Toro Icarte, Sheila A. McIlraith

Figure 1 for Reward Machines for Deep RL in Noisy and Uncertain Environments

Figure 2 for Reward Machines for Deep RL in Noisy and Uncertain Environments

Figure 3 for Reward Machines for Deep RL in Noisy and Uncertain Environments

Figure 4 for Reward Machines for Deep RL in Noisy and Uncertain Environments

Abstract:Reward Machines provide an automata-inspired structure for specifying instructions, safety constraints, and other temporally extended reward-worthy behaviour. By exposing complex reward function structure, they enable counterfactual learning updates that have resulted in impressive sample efficiency gains. While Reward Machines have been employed in both tabular and deep RL settings, they have typically relied on a ground-truth interpretation of the domain-specific vocabulary that form the building blocks of the reward function. Such ground-truth interpretations can be elusive in many real-world settings, due in part to partial observability or noisy sensing. In this paper, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that leverage task structure under uncertain interpretation of domain-specific vocabulary. Theoretical analysis exposes pitfalls in naive approaches to this problem, while experimental results show that our algorithms successfully leverage task structure to improve performance under noisy interpretations of the vocabulary. Our results provide a general framework for exploiting Reward Machines in partially observable environments.

Via

Access Paper or Ask Questions

PRP Rebooted: Advancing the State of the Art in FOND Planning

Dec 20, 2023

Christian Muise, Sheila A. McIlraith, J. Christopher Beck

Figure 1 for PRP Rebooted: Advancing the State of the Art in FOND Planning

Figure 2 for PRP Rebooted: Advancing the State of the Art in FOND Planning

Figure 3 for PRP Rebooted: Advancing the State of the Art in FOND Planning

Figure 4 for PRP Rebooted: Advancing the State of the Art in FOND Planning

Abstract:Fully Observable Non-Deterministic (FOND) planning is a variant of classical symbolic planning in which actions are nondeterministic, with an action's outcome known only upon execution. It is a popular planning paradigm with applications ranging from robot planning to dialogue-agent design and reactive synthesis. Over the last 20 years, a number of approaches to FOND planning have emerged. In this work, we establish a new state of the art, following in the footsteps of some of the most powerful FOND planners to date. Our planner, PR2, decisively outperforms the four leading FOND planners, at times by a large margin, in 17 of 18 domains that represent a comprehensive benchmark suite. Ablation studies demonstrate the impact of various techniques we introduce, with the largest improvement coming from our novel FOND-aware heuristic.

* 13 pages, 4 figures, AAAI conference paper Update: Fixed abstract and typos

Via

Access Paper or Ask Questions

Remembering to Be Fair: On Non-Markovian Fairness in Sequential DecisionMaking

Dec 08, 2023

Parand A. Alamdari, Toryn Q. Klassen, Elliot Creager, Sheila A. McIlraith

Figure 1 for Remembering to Be Fair: On Non-Markovian Fairness in Sequential DecisionMaking

Figure 2 for Remembering to Be Fair: On Non-Markovian Fairness in Sequential DecisionMaking

Figure 3 for Remembering to Be Fair: On Non-Markovian Fairness in Sequential DecisionMaking

Figure 4 for Remembering to Be Fair: On Non-Markovian Fairness in Sequential DecisionMaking

Abstract:Fair decision making has largely been studied with respect to a single decision. In this paper we investigate the notion of fairness in the context of sequential decision making where multiple stakeholders can be affected by the outcomes of decisions, and where decision making may be informed by additional constraints and criteria beyond the requirement of fairness. In this setting, we observe that fairness often depends on the history of the sequential decision-making process and not just on the current state. To advance our understanding of this class of fairness problems, we define the notion of non-Markovian fairness in the context of sequential decision making. We identify properties of non-Markovian fairness, including notions of long-term, anytime, periodic, and bounded fairness. We further explore the interplay between non-Markovian fairness and memory, and how this can support construction of fair policies in sequential decision-making settings.

* 9 pages

Via

Access Paper or Ask Questions

Learning Symbolic Representations for Reinforcement Learning of Non-Markovian Behavior

Jan 08, 2023

Phillip J. K. Christoffersen, Andrew C. Li, Rodrigo Toro Icarte, Sheila A. McIlraith

Figure 1 for Learning Symbolic Representations for Reinforcement Learning of Non-Markovian Behavior

Figure 2 for Learning Symbolic Representations for Reinforcement Learning of Non-Markovian Behavior

Abstract:Many real-world reinforcement learning (RL) problems necessitate learning complex, temporally extended behavior that may only receive reward signal when the behavior is completed. If the reward-worthy behavior is known, it can be specified in terms of a non-Markovian reward function - a function that depends on aspects of the state-action history, rather than just the current state and action. Such reward functions yield sparse rewards, necessitating an inordinate number of experiences to find a policy that captures the reward-worthy pattern of behavior. Recent work has leveraged Knowledge Representation (KR) to provide a symbolic abstraction of aspects of the state that summarize reward-relevant properties of the state-action history and support learning a Markovian decomposition of the problem in terms of an automaton over the KR. Providing such a decomposition has been shown to vastly improve learning rates, especially when coupled with algorithms that exploit automaton structure. Nevertheless, such techniques rely on a priori knowledge of the KR. In this work, we explore how to automatically discover useful state abstractions that support learning automata over the state-action history. The result is an end-to-end algorithm that can learn optimal policies with significantly fewer environment samples than state-of-the-art RL on simple non-Markovian domains.

* 7 pages, 2 figures, presented at KR2ML workshop at NeurIPS 2020

Via

Access Paper or Ask Questions