Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Leon Eshuijs

But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors

May 23, 2025

Leon Eshuijs, Archie Chaudhury, Alan McBeth, Ethan Nguyen

Abstract:Recent safety evaluations of Large Language Models (LLMs) show that many models exhibit dishonest behavior, such as sycophancy. However, most honesty benchmarks focus exclusively on factual knowledge or explicitly harmful behavior and rely on external judges, which are often unable to detect less obvious forms of dishonesty. In this work, we introduce a new framework, Judge Using Safety-Steered Alternatives (JUSSA), which utilizes steering vectors trained on a single sample to elicit more honest responses from models, helping LLM-judges in the detection of dishonest behavior. To test our framework, we introduce a new manipulation dataset with prompts specifically designed to elicit deceptive responses. We find that JUSSA enables LLM judges to better differentiate between dishonest and benign responses, and helps them identify subtle instances of manipulative behavior.

Via

Access Paper or Ask Questions

Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

May 09, 2025

Leon Eshuijs, Shihan Wang, Antske Fokkens

Abstract:Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.

Via

Access Paper or Ask Questions

Balancing the Scales: Reinforcement Learning for Fair Classification

Jul 15, 2024

Leon Eshuijs, Shihan Wang, Antske Fokkens

Figure 1 for Balancing the Scales: Reinforcement Learning for Fair Classification

Figure 2 for Balancing the Scales: Reinforcement Learning for Fair Classification

Figure 3 for Balancing the Scales: Reinforcement Learning for Fair Classification

Figure 4 for Balancing the Scales: Reinforcement Learning for Fair Classification

Abstract:Fairness in classification tasks has traditionally focused on bias removal from neural representations, but recent trends favor algorithmic methods that embed fairness into the training process. These methods steer models towards fair performance, preventing potential elimination of valuable information that arises from representation manipulation. Reinforcement Learning (RL), with its capacity for learning through interaction and adjusting reward functions to encourage desired behaviors, emerges as a promising tool in this domain. In this paper, we explore the usage of RL to address bias in imbalanced classification by scaling the reward function to mitigate bias. We employ the contextual multi-armed bandit framework and adapt three popular RL algorithms to suit our objectives, demonstrating a novel approach to mitigating bias.

Via

Access Paper or Ask Questions