Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Himabindu Lakkaraju

Are Large Language Models Post Hoc Explainers?

Oct 10, 2023

Nicholas Kroeger, Dan Ley, Satyapriya Krishna, Chirag Agarwal, Himabindu Lakkaraju

Figure 1 for Are Large Language Models Post Hoc Explainers?

Figure 2 for Are Large Language Models Post Hoc Explainers?

Figure 3 for Are Large Language Models Post Hoc Explainers?

Figure 4 for Are Large Language Models Post Hoc Explainers?

Abstract:Large Language Models (LLMs) are increasingly used as powerful tools for a plethora of natural language processing (NLP) applications. A recent innovation, in-context learning (ICL), enables LLMs to learn new tasks by supplying a few examples in the prompt during inference time, thereby eliminating the need for model fine-tuning. While LLMs have been utilized in several applications, their applicability in explaining the behavior of other models remains relatively unexplored. Despite the growing number of new explanation techniques, many require white-box access to the model and/or are computationally expensive, highlighting a need for next-generation post hoc explainers. In this work, we present the first framework to study the effectiveness of LLMs in explaining other predictive models. More specifically, we propose a novel framework encompassing multiple prompting strategies: i) Perturbation-based ICL, ii) Prediction-based ICL, iii) Instruction-based ICL, and iv) Explanation-based ICL, with varying levels of information about the underlying ML model and the local neighborhood of the test sample. We conduct extensive experiments with real-world benchmark datasets to demonstrate that LLM-generated explanations perform on par with state-of-the-art post hoc explainers using their ability to leverage ICL examples and their internal knowledge in generating model explanations. On average, across four datasets and two ML models, we observe that LLMs identify the most important feature with 72.19% accuracy, opening up new frontiers in explainable artificial intelligence (XAI) to explore LLM-based explanation frameworks.

Via

Access Paper or Ask Questions

On the Trade-offs between Adversarial Robustness and Actionable Explanations

Sep 28, 2023

Satyapriya Krishna, Chirag Agarwal, Himabindu Lakkaraju

Abstract:As machine learning models are increasingly being employed in various high-stakes settings, it becomes important to ensure that predictions of these models are not only adversarially robust, but also readily explainable to relevant stakeholders. However, it is unclear if these two notions can be simultaneously achieved or if there exist trade-offs between them. In this work, we make one of the first attempts at studying the impact of adversarially robust models on actionable explanations which provide end users with a means for recourse. We theoretically and empirically analyze the cost (ease of implementation) and validity (probability of obtaining a positive model prediction) of recourses output by state-of-the-art algorithms when the underlying models are adversarially robust vs. non-robust. More specifically, we derive theoretical bounds on the differences between the cost and the validity of the recourses generated by state-of-the-art algorithms for adversarially robust vs. non-robust linear and non-linear models. Our empirical results with multiple real-world datasets validate our theoretical results and show the impact of varying degrees of model robustness on the cost and validity of the resulting recourses. Our analyses demonstrate that adversarially robust models significantly increase the cost and reduce the validity of the resulting recourses, thus shedding light on the inherent trade-offs between adversarial robustness and actionable explanations

Via

Access Paper or Ask Questions

Accurate, Explainable, and Private Models: Providing Recourse While Minimizing Training Data Leakage

Aug 08, 2023

Catherine Huang, Chelse Swoopes, Christina Xiao, Jiaqi Ma, Himabindu Lakkaraju

Figure 1 for Accurate, Explainable, and Private Models: Providing Recourse While Minimizing Training Data Leakage

Figure 2 for Accurate, Explainable, and Private Models: Providing Recourse While Minimizing Training Data Leakage

Figure 3 for Accurate, Explainable, and Private Models: Providing Recourse While Minimizing Training Data Leakage

Figure 4 for Accurate, Explainable, and Private Models: Providing Recourse While Minimizing Training Data Leakage

Abstract:Machine learning models are increasingly utilized across impactful domains to predict individual outcomes. As such, many models provide algorithmic recourse to individuals who receive negative outcomes. However, recourse can be leveraged by adversaries to disclose private information. This work presents the first attempt at mitigating such attacks. We present two novel methods to generate differentially private recourse: Differentially Private Model (DPM) and Laplace Recourse (LR). Using logistic regression classifiers and real world and synthetic datasets, we find that DPM and LR perform well in reducing what an adversary can infer, especially at low FPR. When training dataset size is large enough, we find particular success in preventing privacy leakage while maintaining model and recourse accuracy with our novel LR method.

* Proceedings of The Second Workshop on New Frontiers in Adversarial Machine Learning (AdvML-Frontiers @ ICML 2023)

Via

Access Paper or Ask Questions

Verifiable Feature Attributions: A Bridge between Post Hoc Explainability and Inherent Interpretability

Jul 27, 2023

Usha Bhalla, Suraj Srinivas, Himabindu Lakkaraju

Abstract:With the increased deployment of machine learning models in various real-world applications, researchers and practitioners alike have emphasized the need for explanations of model behaviour. To this end, two broad strategies have been outlined in prior literature to explain models. Post hoc explanation methods explain the behaviour of complex black-box models by highlighting features that are critical to model predictions; however, prior work has shown that these explanations may not be faithful, and even more concerning is our inability to verify them. Specifically, it is nontrivial to evaluate if a given attribution is correct with respect to the underlying model. Inherently interpretable models, on the other hand, circumvent these issues by explicitly encoding explanations into model architecture, meaning their explanations are naturally faithful and verifiable, but they often exhibit poor predictive performance due to their limited expressive power. In this work, we aim to bridge the gap between the aforementioned strategies by proposing Verifiability Tuning (VerT), a method that transforms black-box models into models that naturally yield faithful and verifiable feature attributions. We begin by introducing a formal theoretical framework to understand verifiability and show that attributions produced by standard models cannot be verified. We then leverage this framework to propose a method to build verifiable models and feature attributions out of fully trained black-box models. Finally, we perform extensive experiments on semi-synthetic and real-world datasets, and show that VerT produces models that (1) yield explanations that are correct and verifiable and (2) are faithful to the original black-box models they are meant to explain.

Via

Access Paper or Ask Questions

Efficient Estimation of the Local Robustness of Machine Learning Models

Jul 26, 2023

Tessa Han, Suraj Srinivas, Himabindu Lakkaraju

Abstract:Machine learning models often need to be robust to noisy input data. The effect of real-world noise (which is often random) on model predictions is captured by a model's local robustness, i.e., the consistency of model predictions in a local region around an input. However, the na\"ive approach to computing local robustness based on Monte-Carlo sampling is statistically inefficient, leading to prohibitive computational costs for large-scale applications. In this work, we develop the first analytical estimators to efficiently compute local robustness of multi-class discriminative models using local linear function approximation and the multivariate Normal CDF. Through the derivation of these estimators, we show how local robustness is connected to concepts such as randomized smoothing and softmax probability. We also confirm empirically that these estimators accurately and efficiently compute the local robustness of standard deep learning models. In addition, we demonstrate these estimators' usefulness for various tasks involving local robustness, such as measuring robustness bias and identifying examples that are vulnerable to noise perturbation in a dataset. By developing these analytical estimators, this work not only advances conceptual understanding of local robustness, but also makes its computation practical, enabling the use of local robustness in critical downstream applications.

Via

Access Paper or Ask Questions

Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

Jul 25, 2023

Skyler Wu, Eric Meng Shen, Charumathi Badrinath, Jiaqi Ma, Himabindu Lakkaraju

Figure 1 for Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

Figure 2 for Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

Figure 3 for Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

Figure 4 for Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions

Abstract:Chain-of-thought (CoT) prompting has been shown to empirically improve the accuracy of large language models (LLMs) on various question answering tasks. While understanding why CoT prompting is effective is crucial to ensuring that this phenomenon is a consequence of desired model behavior, little work has addressed this; nonetheless, such an understanding is a critical prerequisite for responsible model deployment. We address this question by leveraging gradient-based feature attribution methods which produce saliency scores that capture the influence of input tokens on model output. Specifically, we probe several open-source LLMs to investigate whether CoT prompting affects the relative importances they assign to particular input tokens. Our results indicate that while CoT prompting does not increase the magnitude of saliency scores attributed to semantically relevant tokens in the prompt compared to standard few-shot prompting, it increases the robustness of saliency scores to question perturbations and variations in model output.

* Accepted to Workshop on Challenges in Deployable Generative AI at ICML 2023

Via

Access Paper or Ask Questions

Consistent Explanations in the Face of Model Indeterminacy via Ensembling

Jun 13, 2023

Dan Ley, Leonard Tang, Matthew Nazari, Hongjin Lin, Suraj Srinivas, Himabindu Lakkaraju

Figure 1 for Consistent Explanations in the Face of Model Indeterminacy via Ensembling

Figure 2 for Consistent Explanations in the Face of Model Indeterminacy via Ensembling

Figure 3 for Consistent Explanations in the Face of Model Indeterminacy via Ensembling

Figure 4 for Consistent Explanations in the Face of Model Indeterminacy via Ensembling

Abstract:This work addresses the challenge of providing consistent explanations for predictive models in the presence of model indeterminacy, which arises due to the existence of multiple (nearly) equally well-performing models for a given dataset and task. Despite their similar performance, such models often exhibit inconsistent or even contradictory explanations for their predictions, posing challenges to end users who rely on these models to make critical decisions. Recognizing this issue, we introduce ensemble methods as an approach to enhance the consistency of the explanations provided in these scenarios. Leveraging insights from recent work on neural network loss landscapes and mode connectivity, we devise ensemble strategies to efficiently explore the underspecification set -- the set of models with performance variations resulting solely from changes in the random seed during training. Experiments on five benchmark financial datasets reveal that ensembling can yield significant improvements when it comes to explanation similarity, and demonstrate the potential of existing ensemble methods to explore the underspecification set efficiently. Our findings highlight the importance of considering model indeterminacy when interpreting explanations and showcase the effectiveness of ensembles in enhancing the reliability of explanations in machine learning.

Via

Access Paper or Ask Questions

On Minimizing the Impact of Dataset Shifts on Actionable Explanations

Jun 11, 2023

Anna P. Meyer, Dan Ley, Suraj Srinivas, Himabindu Lakkaraju

Figure 1 for On Minimizing the Impact of Dataset Shifts on Actionable Explanations

Figure 2 for On Minimizing the Impact of Dataset Shifts on Actionable Explanations

Figure 3 for On Minimizing the Impact of Dataset Shifts on Actionable Explanations

Figure 4 for On Minimizing the Impact of Dataset Shifts on Actionable Explanations

Abstract:The Right to Explanation is an important regulatory principle that allows individuals to request actionable explanations for algorithmic decisions. However, several technical challenges arise when providing such actionable explanations in practice. For instance, models are periodically retrained to handle dataset shifts. This process may invalidate some of the previously prescribed explanations, thus rendering them unactionable. But, it is unclear if and when such invalidations occur, and what factors determine explanation stability i.e., if an explanation remains unchanged amidst model retraining due to dataset shifts. In this paper, we address the aforementioned gaps and provide one of the first theoretical and empirical characterizations of the factors influencing explanation stability. To this end, we conduct rigorous theoretical analysis to demonstrate that model curvature, weight decay parameters while training, and the magnitude of the dataset shift are key factors that determine the extent of explanation (in)stability. Extensive experimentation with real-world datasets not only validates our theoretical results, but also demonstrates that the aforementioned factors dramatically impact the stability of explanations produced by various state-of-the-art methods.

* 30 pages, 19 figures. To be published at UAI 2023

Via

Access Paper or Ask Questions

Word-Level Explanations for Analyzing Bias in Text-to-Image Models

Jun 03, 2023

Alexander Lin, Lucas Monteiro Paes, Sree Harsha Tanneru, Suraj Srinivas, Himabindu Lakkaraju

Abstract:Text-to-image models take a sentence (i.e., prompt) and generate images associated with this input prompt. These models have created award wining-art, videos, and even synthetic datasets. However, text-to-image (T2I) models can generate images that underrepresent minorities based on race and sex. This paper investigates which word in the input prompt is responsible for bias in generated images. We introduce a method for computing scores for each word in the prompt; these scores represent its influence on biases in the model's output. Our method follows the principle of \emph{explaining by removing}, leveraging masked language models to calculate the influence scores. We perform experiments on Stable Diffusion to demonstrate that our method identifies the replication of societal stereotypes in generated images.

* 5 main pages, 3 pages in appendix, and 3 figures

Via

Access Paper or Ask Questions

Post Hoc Explanations of Language Models Can Improve Language Models

May 19, 2023

Satyapriya, Krishna, Jiaqi Ma, Dylan Slack, Asma Ghandeharioun, Sameer Singh, Himabindu Lakkaraju

Figure 1 for Post Hoc Explanations of Language Models Can Improve Language Models

Figure 2 for Post Hoc Explanations of Language Models Can Improve Language Models

Figure 3 for Post Hoc Explanations of Language Models Can Improve Language Models

Figure 4 for Post Hoc Explanations of Language Models Can Improve Language Models

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in performing complex tasks. Moreover, recent research has shown that incorporating human-annotated rationales (e.g., Chain-of- Thought prompting) during in-context learning can significantly enhance the performance of these models, particularly on tasks that require reasoning capabilities. However, incorporating such rationales poses challenges in terms of scalability as this requires a high degree of human involvement. In this work, we present a novel framework, Amplifying Model Performance by Leveraging In-Context Learning with Post Hoc Explanations (AMPLIFY), which addresses the aforementioned challenges by automating the process of rationale generation. To this end, we leverage post hoc explanation methods which output attribution scores (explanations) capturing the influence of each of the input features on model predictions. More specifically, we construct automated natural language rationales that embed insights from post hoc explanations to provide corrective signals to LLMs. Extensive experimentation with real-world datasets demonstrates that our framework, AMPLIFY, leads to prediction accuracy improvements of about 10-25% over a wide range of tasks, including those where prior approaches which rely on human-annotated rationales such as Chain-of-Thought prompting fall short. Our work makes one of the first attempts at highlighting the potential of post hoc explanations as valuable tools for enhancing the effectiveness of LLMs. Furthermore, we conduct additional empirical analyses and ablation studies to demonstrate the impact of each of the components of AMPLIFY, which, in turn, lead to critical insights for refining in-context learning.

Via

Access Paper or Ask Questions