Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei Jie Yeo

Understanding Refusal in Language Models with Sparse Autoencoders

May 29, 2025

Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Roy Ka-Wei Lee, Erik Cambria, Ranjan Satapathy

Abstract:Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.

Via

Access Paper or Ask Questions

Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

May 23, 2025

Wei Jie Yeo, Rui Mao, Moloud Abdar, Erik Cambria, Ranjan Satapathy

Figure 1 for Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

Figure 2 for Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

Figure 3 for Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

Figure 4 for Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads

Abstract:Multimodal models like CLIP have gained significant attention due to their remarkable zero-shot performance across various tasks. However, studies have revealed that CLIP can inadvertently learn spurious associations between target variables and confounding factors. To address this, we introduce \textsc{Locate-Then-Correct} (LTC), a contrastive framework that identifies spurious attention heads in Vision Transformers via mechanistic insights and mitigates them through targeted ablation. Furthermore, LTC identifies salient, task-relevant attention heads, enabling the integration of discriminative features through orthogonal projection to improve classification performance. We evaluate LTC on benchmarks with inherent background and gender biases, achieving over a $>50\%$ gain in worst-group accuracy compared to non-training post-hoc baselines. Additionally, we visualize the representation of selected heads and find that the presented interpretation corroborates our contrastive mechanism for identifying both spurious and salient attention heads. Code available at https://github.com/wj210/CLIP_LTC.

* Under review

Via

Access Paper or Ask Questions

Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

Oct 18, 2024

Wei Jie Yeo, Ranjan Satapthy, Erik Cambria

Figure 1 for Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

Figure 2 for Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

Figure 3 for Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

Figure 4 for Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models

Abstract:Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. However, the faithfulness of these explanations should not be readily trusted at face value. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness. Moreover, we highlight the risks of grounding faithfulness findings on out-of-distribution samples. In this work, we leverage a causal mediation technique called activation patching, to measure the faithfulness of an explanation towards supporting the explained answer. Our proposed metric, Causal Faithfulness quantifies the consistency of causal attributions between explanations and the corresponding model outputs as the indicator of faithfulness. We experimented across models varying from 2B to 27B parameters and found that models that underwent alignment tuning tend to produce more faithful and plausible explanations. We find that Causal Faithfulness is a promising improvement over existing faithfulness tests by taking into account the model's internal computations and avoiding out of distribution concerns that could otherwise undermine the validity of faithfulness assessments. We release the code in \url{https://github.com/wj210/Causal-Faithfulness}

* Under review

Via

Access Paper or Ask Questions

Self-training Large Language Models through Knowledge Detection

Jun 17, 2024

Wei Jie Yeo, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, Erik Cambria

Figure 1 for Self-training Large Language Models through Knowledge Detection

Figure 2 for Self-training Large Language Models through Knowledge Detection

Figure 3 for Self-training Large Language Models through Knowledge Detection

Figure 4 for Self-training Large Language Models through Knowledge Detection

Abstract:Large language models (LLMs) often necessitate extensive labeled datasets and training compute to achieve impressive performance across downstream tasks. This paper explores a self-training paradigm, where the LLM autonomously curates its own labels and selectively trains on unknown data samples identified through a reference-free consistency method. Empirical evaluations demonstrate significant improvements in reducing hallucination in generation across multiple subjects. Furthermore, the selective training framework mitigates catastrophic forgetting in out-of-distribution benchmarks, addressing a critical limitation in training LLMs. Our findings suggest that such an approach can substantially reduce the dependency on large labeled datasets, paving the way for more scalable and cost-effective language model training.

* Under review

Via

Access Paper or Ask Questions

Plausible Extractive Rationalization through Semi-Supervised Entailment Signal

Feb 25, 2024

Wei Jie Yeo, Ranjan Satapathy, Erik Cambria

Figure 1 for Plausible Extractive Rationalization through Semi-Supervised Entailment Signal

Figure 2 for Plausible Extractive Rationalization through Semi-Supervised Entailment Signal

Figure 3 for Plausible Extractive Rationalization through Semi-Supervised Entailment Signal

Figure 4 for Plausible Extractive Rationalization through Semi-Supervised Entailment Signal

Abstract:The increasing use of complex and opaque black box models requires the adoption of interpretable measures, one such option is extractive rationalizing models, which serve as a more interpretable alternative. These models, also known as Explain-Then-Predict models, employ an explainer model to extract rationales and subsequently condition the predictor with the extracted information. Their primary objective is to provide precise and faithful explanations, represented by the extracted rationales. In this paper, we take a semi-supervised approach to optimize for the plausibility of extracted rationales. We adopt a pre-trained natural language inference (NLI) model and further fine-tune it on a small set of supervised rationales ($10\%$). The NLI predictor is leveraged as a source of supervisory signals to the explainer via entailment alignment. We show that, by enforcing the alignment agreement between the explanation and answer in a question-answering task, the performance can be improved without access to ground truth labels. We evaluate our approach on the ERASER dataset and show that our approach achieves comparable results with supervised extractive models and outperforms unsupervised approaches by $> 100\%$.

* Under review

Via

Access Paper or Ask Questions

How Interpretable are Reasoning Explanations from Prompting Large Language Models?

Feb 25, 2024

Wei Jie Yeo, Ranjan Satapathy, Goh Siow Mong, Rick, Erik Cambria

Figure 1 for How Interpretable are Reasoning Explanations from Prompting Large Language Models?

Figure 2 for How Interpretable are Reasoning Explanations from Prompting Large Language Models?

Figure 3 for How Interpretable are Reasoning Explanations from Prompting Large Language Models?

Figure 4 for How Interpretable are Reasoning Explanations from Prompting Large Language Models?

Abstract:Prompt Engineering has garnered significant attention for enhancing the performance of large language models across a multitude of tasks. Techniques such as the Chain-of-Thought not only bolster task performance but also delineate a clear trajectory of reasoning steps, offering a tangible form of explanation for the audience. Prior works on interpretability assess the reasoning chains yielded by Chain-of-Thought solely along a singular axis, namely faithfulness. We present a comprehensive and multifaceted evaluation of interpretability, examining not only faithfulness but also robustness and utility across multiple commonsense reasoning benchmarks. Likewise, our investigation is not confined to a single prompting technique; it expansively covers a multitude of prevalent prompting techniques employed in large language models, thereby ensuring a wide-ranging and exhaustive evaluation. In addition, we introduce a simple interpretability alignment technique, termed Self-Entailment-Alignment Chain-of-thought, that yields more than 70\% improvements across multiple dimensions of interpretability. Code is available at https://github.com/wj210/CoT_interpretability

* Under review

Via

Access Paper or Ask Questions

A Comprehensive Review on Financial Explainable AI

Sep 21, 2023

Wei Jie Yeo, Wihan van der Heever, Rui Mao, Erik Cambria, Ranjan Satapathy, Gianmarco Mengaldo

Figure 1 for A Comprehensive Review on Financial Explainable AI

Figure 2 for A Comprehensive Review on Financial Explainable AI

Figure 3 for A Comprehensive Review on Financial Explainable AI

Figure 4 for A Comprehensive Review on Financial Explainable AI

Abstract:The success of artificial intelligence (AI), and deep learning models in particular, has led to their widespread adoption across various industries due to their ability to process huge amounts of data and learn complex patterns. However, due to their lack of explainability, there are significant concerns regarding their use in critical sectors, such as finance and healthcare, where decision-making transparency is of paramount importance. In this paper, we provide a comparative survey of methods that aim to improve the explainability of deep learning models within the context of finance. We categorize the collection of explainable AI methods according to their corresponding characteristics, and we review the concerns and challenges of adopting explainable AI methods, together with future directions we deemed appropriate and important.

Via

Access Paper or Ask Questions