Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Adel Bibi

Open Problems in Machine Unlearning for AI Safety

Jan 09, 2025

Fazl Barez, Tingchen Fu, Ameya Prabhu, Stephen Casper, Amartya Sanyal, Adel Bibi, Aidan O'Gara, Robert Kirk, Ben Bucknall, Tim Fist(+9 more)

Abstract:As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research, and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks, which has been the primary focus of existing research. More recently, its potential application to AI safety has gained attention. In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear (CBRN) safety. In these contexts, information can be both beneficial and harmful, and models may combine seemingly harmless information for harmful purposes -- unlearning this information could strongly affect beneficial uses. We provide an overview of inherent constraints and open problems, including the broader side effects of unlearning dangerous knowledge, as well as previously unexplored tensions between unlearning and existing safety mechanisms. Finally, we investigate challenges related to evaluation, robustness, and the preservation of safety features during unlearning. By mapping these limitations and open challenges, we aim to guide future research toward realistic applications of unlearning within a broader AI safety framework, acknowledging its limitations and highlighting areas where alternative approaches may be required.

Via

Access Paper or Ask Questions

Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Unanswerable Questions and Ambiguous Prompts

Dec 13, 2024

Hazel Kim, Adel Bibi, Philip Torr, Yarin Gal

Figure 1 for Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Unanswerable Questions and Ambiguous Prompts

Figure 2 for Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Unanswerable Questions and Ambiguous Prompts

Figure 3 for Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Unanswerable Questions and Ambiguous Prompts

Figure 4 for Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Unanswerable Questions and Ambiguous Prompts

Abstract:Large language models (LLMs) frequently generate confident yet inaccurate responses, introducing significant risks for deployment in safety-critical domains. We present a novel approach to detecting model hallucination through systematic analysis of information flow across model layers when processing inputs with insufficient or ambiguous context. Our investigation reveals that hallucination manifests as usable information deficiencies in inter-layer transmissions. While existing approaches primarily focus on final-layer output analysis, we demonstrate that tracking cross-layer information dynamics ($\mathcal{L}$I) provides robust indicators of model reliability, accounting for both information gain and loss during computation. $\mathcal{L}$I improves model reliability by immediately integrating with universal LLMs without additional training or architectural modifications.

Via

Access Paper or Ask Questions

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Aug 27, 2024

Wenxuan Zhang, Philip H. S. Torr, Mohamed Elhoseiny, Adel Bibi

Figure 1 for Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Figure 2 for Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Figure 3 for Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Figure 4 for Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Abstract:Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during the fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In the supervised optimization, a labeling function is used to capture global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark including comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO eliminates the need for human prompting and annotation in LLM fine-tuning while achieving the same level of safety as methods that heavily rely on human labor, with less than 10% of the computational resources. The training recipes and models will be released.

Via

Access Paper or Ask Questions

FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging

Jul 11, 2024

Kumail Alhamoud, Yasir Ghunaim, Motasem Alfarra, Thomas Hartvigsen, Philip Torr, Bernard Ghanem, Adel Bibi, Marzyeh Ghassemi

Figure 1 for FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging

Figure 2 for FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging

Figure 3 for FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging

Figure 4 for FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging

Abstract:For medical imaging AI models to be clinically impactful, they must generalize. However, this goal is hindered by (i) diverse types of distribution shifts, such as temporal, demographic, and label shifts, and (ii) limited diversity in datasets that are siloed within single medical institutions. While these limitations have spurred interest in federated learning, current evaluation benchmarks fail to evaluate different shifts simultaneously. However, in real healthcare settings, multiple types of shifts co-exist, yet their impact on medical imaging performance remains unstudied. In response, we introduce FedMedICL, a unified framework and benchmark to holistically evaluate federated medical imaging challenges, simultaneously capturing label, demographic, and temporal distribution shifts. We comprehensively evaluate several popular methods on six diverse medical imaging datasets (totaling 550 GPU hours). Furthermore, we use FedMedICL to simulate COVID-19 propagation across hospitals and evaluate whether methods can adapt to pandemic changes in disease prevalence. We find that a simple batch balancing technique surpasses advanced methods in average performance across FedMedICL experiments. This finding questions the applicability of results from previous, narrow benchmarks in real-world medical settings.

* Accepted at MICCAI 2024. Code is available at: https://github.com/m1k2zoo/FedMedICL

Via

Access Paper or Ask Questions

Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

Jun 20, 2024

Hasan Abed Al Kader Hammoud, Umberto Michieli, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem, Mete Ozay

Abstract:Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during merging, resulting in models that excel in both domain expertise and alignment.

* Under review

Via

Access Paper or Ask Questions

Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

Jun 12, 2024

Francisco Eiras, Aleksandar Petrov, Phillip H. S. Torr, M. Pawan Kumar, Adel Bibi

Figure 1 for Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

Figure 2 for Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

Figure 3 for Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

Figure 4 for Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

Abstract:Fine-tuning large language models on small, high-quality datasets can enhance their performance on specific downstream tasks. Recent research shows that fine-tuning on benign, instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. Although critical, understanding and mitigating safety risks in well-defined tasks remains distinct from the instruction-following context due to structural differences in the data. Our work explores the risks associated with fine-tuning closed models - where providers control how user data is utilized in the process - across diverse task-specific data. We demonstrate how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data, showing this is more effective than existing baselines at re-establishing safety alignment while maintaining similar task performance.

Via

Access Paper or Ask Questions

Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation

Jun 07, 2024

Yibo Yang, Xiaojie Li, Motasem Alfarra, Hasan Hammoud, Adel Bibi, Philip Torr, Bernard Ghanem

Figure 1 for Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation

Figure 2 for Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation

Figure 3 for Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation

Figure 4 for Towards Interpretable Deep Local Learning with Successive Gradient Reconciliation

Abstract:Relieving the reliance of neural network training on a global back-propagation (BP) has emerged as a notable research topic due to the biological implausibility and huge memory consumption caused by BP. Among the existing solutions, local learning optimizes gradient-isolated modules of a neural network with local errors and has been proved to be effective even on large-scale datasets. However, the reconciliation among local errors has never been investigated. In this paper, we first theoretically study non-greedy layer-wise training and show that the convergence cannot be assured when the local gradient in a module w.r.t. its input is not reconciled with the local gradient in the previous module w.r.t. its output. Inspired by the theoretical result, we further propose a local training strategy that successively regularizes the gradient reconciliation between neighboring modules without breaking gradient isolation or introducing any learnable parameters. Our method can be integrated into both local-BP and BP-free settings. In experiments, we achieve significant performance improvements compared to previous methods. Particularly, our method for CNN and Transformer architectures on ImageNet is able to attain a competitive performance with global BP, saving more than 40% memory consumption.

* ICML 2024

Via

Access Paper or Ask Questions

Universal In-Context Approximation By Prompting Fully Recurrent Models

Jun 03, 2024

Aleksandar Petrov, Tom A. Lamb, Alasdair Paren, Philip H. S. Torr, Adel Bibi

Figure 1 for Universal In-Context Approximation By Prompting Fully Recurrent Models

Figure 2 for Universal In-Context Approximation By Prompting Fully Recurrent Models

Figure 3 for Universal In-Context Approximation By Prompting Fully Recurrent Models

Figure 4 for Universal In-Context Approximation By Prompting Fully Recurrent Models

Abstract:Zero-shot and in-context learning enable solving tasks without model fine-tuning, making them essential for developing generative model solutions. Therefore, it is crucial to understand whether a pretrained model can be prompted to approximate any function, i.e., whether it is a universal in-context approximator. While it was recently shown that transformer models do possess this property, these results rely on their attention mechanism. Hence, these findings do not apply to fully recurrent architectures like RNNs, LSTMs, and the increasingly popular SSMs. We demonstrate that RNNs, LSTMs, GRUs, Linear RNNs, and linear gated architectures such as Mamba and Hawk/Griffin can also serve as universal in-context approximators. To streamline our argument, we introduce a programming language called LSRL that compiles to these fully recurrent architectures. LSRL may be of independent interest for further studies of fully recurrent models, such as constructing interpretability benchmarks. We also study the role of multiplicative gating and observe that architectures incorporating such gating (e.g., LSTMs, GRUs, Hawk/Griffin) can implement certain operations more stably, making them more viable candidates for practical in-context universal approximation.

Via

Access Paper or Ask Questions

Towards Certification of Uncertainty Calibration under Adversarial Attacks

May 22, 2024

Cornelius Emde, Francesco Pinto, Thomas Lukasiewicz, Philip H. S. Torr, Adel Bibi

Figure 1 for Towards Certification of Uncertainty Calibration under Adversarial Attacks

Figure 2 for Towards Certification of Uncertainty Calibration under Adversarial Attacks

Figure 3 for Towards Certification of Uncertainty Calibration under Adversarial Attacks

Figure 4 for Towards Certification of Uncertainty Calibration under Adversarial Attacks

Abstract:Since neural classifiers are known to be sensitive to adversarial perturbations that alter their accuracy, \textit{certification methods} have been developed to provide provable guarantees on the insensitivity of their predictions to such perturbations. Furthermore, in safety-critical applications, the frequentist interpretation of the confidence of a classifier (also known as model calibration) can be of utmost importance. This property can be measured via the Brier score or the expected calibration error. We show that attacks can significantly harm calibration, and thus propose certified calibration as worst-case bounds on calibration under adversarial perturbations. Specifically, we produce analytic bounds for the Brier score and approximate bounds via the solution of a mixed-integer program on the expected calibration error. Finally, we propose novel calibration attacks and demonstrate how they can improve model calibration through \textit{adversarial calibration training}.

* 11 pages main paper, appendix included

Via

Access Paper or Ask Questions

Risks and Opportunities of Open-Source Generative AI

May 14, 2024

Francisco Eiras, Aleksander Petrov, Bertie Vidgen, Christian Schroeder, Fabio Pizzati, Katherine Elkins, Supratik Mukhopadhyay, Adel Bibi, Aaron Purewal, Csaba Botos(+15 more)

Figure 1 for Risks and Opportunities of Open-Source Generative AI

Figure 2 for Risks and Opportunities of Open-Source Generative AI

Figure 3 for Risks and Opportunities of Open-Source Generative AI

Figure 4 for Risks and Opportunities of Open-Source Generative AI

Abstract:Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about the potential risks of the technology, and resulted in calls for tighter regulation, in particular from some of the major tech companies who are leading in AI development. This regulation is likely to put at risk the budding field of open-source generative AI. Using a three-stage framework for Gen AI development (near, mid and long-term), we analyze the risks and opportunities of open-source generative AI models with similar capabilities to the ones currently available (near to mid-term) and with greater capabilities (long-term). We argue that, overall, the benefits of open-source Gen AI outweigh its risks. As such, we encourage the open sourcing of models, training and evaluation data, and provide a set of recommendations and best practices for managing risks associated with open-source generative AI.

* Extension of arXiv:2404.17047

Via

Access Paper or Ask Questions