Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shariqah Hossain

Investigating Model Editing for Unlearning in Large Language Models

Dec 23, 2025

Shariqah Hossain, Lalana Kagal

Figure 1 for Investigating Model Editing for Unlearning in Large Language Models

Figure 2 for Investigating Model Editing for Unlearning in Large Language Models

Figure 3 for Investigating Model Editing for Unlearning in Large Language Models

Figure 4 for Investigating Model Editing for Unlearning in Large Language Models

Abstract:Machine unlearning aims to remove unwanted information from a model, but many methods are inefficient for LLMs with large numbers of parameters or fail to fully remove the intended information without degrading performance on knowledge that should be retained. Model editing algorithms solve a similar problem of changing information in models, but they focus on redirecting inputs to a new target rather than removing that information altogether. In this work, we explore the editing algorithms ROME, IKE, and WISE and design new editing targets for an unlearning setting. Through this investigation, we show that model editing approaches can exceed baseline unlearning methods in terms of quality of forgetting depending on the setting. Like traditional unlearning techniques, they struggle to encapsulate the scope of what is to be unlearned without damage to the overall model performance.

Via

Access Paper or Ask Questions

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

Jun 11, 2025

Yeonwoo Jang, Shariqah Hossain, Ashwin Sreevatsa, Diogo Cruz

Abstract:In this work, we show that some machine unlearning methods may fail when subjected to straightforward prompt attacks. We systematically evaluate eight unlearning techniques across three model families, and employ output-based, logit-based, and probe analysis to determine to what extent supposedly unlearned knowledge can be retrieved. While methods like RMU and TAR demonstrate robust unlearning, ELM remains vulnerable to specific prompt attacks (e.g., Hindi filler text in original prompt recovering 57.3% accuracy). Our logit analysis also confirms that unlearned models are generally not hiding knowledge by modifying the way the answer is formatted, as the correlation between output and logit accuracy is strong. These results challenge prevailing assumptions about unlearning effectiveness and highlight the need for evaluation frameworks that can reliably distinguish between true knowledge removal and superficial output suppression. We also publicly make available our evaluation framework to easily evaluate prompting techniques to retrieve unlearning knowledge.

* 20 pages, 6 figures

Via

Access Paper or Ask Questions