Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Antske Fokkens

Improving Causal Interventions in Amnesic Probing with Mean Projection or LEACE

Jun 13, 2025

Alicja Dobrzeniecka, Antske Fokkens, Pia Sommerauer

Abstract:Amnesic probing is a technique used to examine the influence of specific linguistic information on the behaviour of a model. This involves identifying and removing the relevant information and then assessing whether the model's performance on the main task changes. If the removed information is relevant, the model's performance should decline. The difficulty with this approach lies in removing only the target information while leaving other information unchanged. It has been shown that Iterative Nullspace Projection (INLP), a widely used removal technique, introduces random modifications to representations when eliminating target information. We demonstrate that Mean Projection (MP) and LEACE, two proposed alternatives, remove information in a more targeted manner, thereby enhancing the potential for obtaining behavioural explanations through Amnesic Probing.

Via

Access Paper or Ask Questions

Short-circuiting Shortcuts: Mechanistic Investigation of Shortcuts in Text Classification

May 09, 2025

Leon Eshuijs, Shihan Wang, Antske Fokkens

Abstract:Reliance on spurious correlations (shortcuts) has been shown to underlie many of the successes of language models. Previous work focused on identifying the input elements that impact prediction. We investigate how shortcuts are actually processed within the model's decision-making mechanism. We use actor names in movie reviews as controllable shortcuts with known impact on the outcome. We use mechanistic interpretability methods and identify specific attention heads that focus on shortcuts. These heads gear the model towards a label before processing the complete input, effectively making premature decisions that bypass contextual analysis. Based on these findings, we introduce Head-based Token Attribution (HTA), which traces intermediate decisions back to input tokens. We show that HTA is effective in detecting shortcuts in LLMs and enables targeted mitigation by selectively deactivating shortcut-related attention heads.

Via

Access Paper or Ask Questions

DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?

Oct 21, 2024

Urja Khurana, Eric Nalisnick, Antske Fokkens

Figure 1 for DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?

Figure 2 for DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?

Figure 3 for DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?

Figure 4 for DefVerify: Do Hate Speech Models Reflect Their Dataset's Definition?

Abstract:When building a predictive model, it is often difficult to ensure that domain-specific requirements are encoded by the model that will eventually be deployed. Consider researchers working on hate speech detection. They will have an idea of what is considered hate speech, but building a model that reflects their view accurately requires preserving those ideals throughout the workflow of data set construction and model training. Complications such as sampling bias, annotation bias, and model misspecification almost always arise, possibly resulting in a gap between the domain specification and the model's actual behavior upon deployment. To address this issue for hate speech detection, we propose DefVerify: a 3-step procedure that (i) encodes a user-specified definition of hate speech, (ii) quantifies to what extent the model reflects the intended definition, and (iii) tries to identify the point of failure in the workflow. We use DefVerify to find gaps between definition and model behavior when applied to six popular hate speech benchmark datasets.

* Preprint

Via

Access Paper or Ask Questions

Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?

Aug 26, 2024

Urja Khurana, Eric Nalisnick, Antske Fokkens, Swabha Swayamdipta

Figure 1 for Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?

Figure 2 for Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?

Figure 3 for Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?

Figure 4 for Crowd-Calibrator: Can Annotator Disagreement Inform Calibration in Subjective Tasks?

Abstract:Subjective tasks in NLP have been mostly relegated to objective standards, where the gold label is decided by taking the majority vote. This obfuscates annotator disagreement and the inherent uncertainty of the label. We argue that subjectivity should factor into model decisions and play a direct role via calibration under a selective prediction setting. Specifically, instead of calibrating confidence purely from the model's perspective, we calibrate models for subjective tasks based on crowd worker agreement. Our method, Crowd-Calibrator, models the distance between the distribution of crowd worker labels and the model's own distribution over labels to inform whether the model should abstain from a decision. On two highly subjective tasks, hate speech detection and natural language inference, our experiments show Crowd-Calibrator either outperforms or achieves competitive performance with existing selective prediction baselines. Our findings highlight the value of bringing human decision-making into model predictions.

* Accepted at COLM 2024

Via

Access Paper or Ask Questions

Balancing the Scales: Reinforcement Learning for Fair Classification

Jul 15, 2024

Leon Eshuijs, Shihan Wang, Antske Fokkens

Figure 1 for Balancing the Scales: Reinforcement Learning for Fair Classification

Figure 2 for Balancing the Scales: Reinforcement Learning for Fair Classification

Figure 3 for Balancing the Scales: Reinforcement Learning for Fair Classification

Figure 4 for Balancing the Scales: Reinforcement Learning for Fair Classification

Abstract:Fairness in classification tasks has traditionally focused on bias removal from neural representations, but recent trends favor algorithmic methods that embed fairness into the training process. These methods steer models towards fair performance, preventing potential elimination of valuable information that arises from representation manipulation. Reinforcement Learning (RL), with its capacity for learning through interaction and adjusting reward functions to encourage desired behaviors, emerges as a promising tool in this domain. In this paper, we explore the usage of RL to address bias in imbalanced classification by scaling the reward function to mitigate bias. We employ the contextual multi-armed bandit framework and adapt three popular RL algorithms to suit our objectives, demonstrating a novel approach to mitigating bias.

Via

Access Paper or Ask Questions

ARM: Efficient Guided Decoding with Autoregressive Reward Models

Jul 05, 2024

Sergey Troshin, Vlad Niculae, Antske Fokkens

Figure 1 for ARM: Efficient Guided Decoding with Autoregressive Reward Models

Figure 2 for ARM: Efficient Guided Decoding with Autoregressive Reward Models

Figure 3 for ARM: Efficient Guided Decoding with Autoregressive Reward Models

Figure 4 for ARM: Efficient Guided Decoding with Autoregressive Reward Models

Abstract:Language models trained on large amounts of data require careful tuning to be safely deployed in real world. We revisit the guided decoding paradigm, where the goal is to augment the logits of the base language model using the scores from a task-specific reward model. We propose a simple but efficient parameterization of the autoregressive reward model enabling fast and effective guided decoding. On detoxification and sentiment control tasks, we show that our efficient parameterization performs on par with RAD, a strong but less efficient guided decoding approach.

Via

Access Paper or Ask Questions

Investigating the Robustness of Modelling Decisions for Few-Shot Cross-Topic Stance Detection: A Preregistered Study

Apr 05, 2024

Myrthe Reuver, Suzan Verberne, Antske Fokkens

Abstract:For a viewpoint-diverse news recommender, identifying whether two news articles express the same viewpoint is essential. One way to determine "same or different" viewpoint is stance detection. In this paper, we investigate the robustness of operationalization choices for few-shot stance detection, with special attention to modelling stance across different topics. Our experiments test pre-registered hypotheses on stance detection. Specifically, we compare two stance task definitions (Pro/Con versus Same Side Stance), two LLM architectures (bi-encoding versus cross-encoding), and adding Natural Language Inference knowledge, with pre-trained RoBERTa models trained with shots of 100 examples from 7 different stance detection datasets. Some of our hypotheses and claims from earlier work can be confirmed, while others give more inconsistent results. The effect of the Same Side Stance definition on performance differs per dataset and is influenced by other modelling choices. We found no relationship between the number of training topics in the training shots and performance. In general, cross-encoding out-performs bi-encoding, and adding NLI training to our models gives considerable improvement, but these results are not consistent across all datasets. Our results indicate that it is essential to include multiple datasets and systematic modelling experiments when aiming to find robust modelling choices for the concept `stance'.

* Accepted at LREC-COLING 2024: cite the published version when available

Via

Access Paper or Ask Questions

The Role of Syntactic Span Preferences in Post-Hoc Explanation Disagreement

Mar 28, 2024

Jonathan Kamp, Lisa Beinborn, Antske Fokkens

Figure 1 for The Role of Syntactic Span Preferences in Post-Hoc Explanation Disagreement

Figure 2 for The Role of Syntactic Span Preferences in Post-Hoc Explanation Disagreement

Figure 3 for The Role of Syntactic Span Preferences in Post-Hoc Explanation Disagreement

Figure 4 for The Role of Syntactic Span Preferences in Post-Hoc Explanation Disagreement

Abstract:Post-hoc explanation methods are an important tool for increasing model transparency for users. Unfortunately, the currently used methods for attributing token importance often yield diverging patterns. In this work, we study potential sources of disagreement across methods from a linguistic perspective. We find that different methods systematically select different classes of words and that methods that agree most with other methods and with humans display similar linguistic preferences. Token-level differences between methods are smoothed out if we compare them on the syntactic span level. We also find higher agreement across methods by estimating the most important spans dynamically instead of relying on a fixed subset of size $k$. We systematically investigate the interaction between $k$ and spans and propose an improved configuration for selecting important tokens.

* Long paper accepted to LREC-Coling 2024 main conference. Please cite the conference proceedings version when available

Via

Access Paper or Ask Questions

Dynamic Top-k Estimation Consolidates Disagreement between Feature Attribution Methods

Oct 09, 2023

Jonathan Kamp, Lisa Beinborn, Antske Fokkens

Abstract:Feature attribution scores are used for explaining the prediction of a text classifier to users by highlighting a k number of tokens. In this work, we propose a way to determine the number of optimal k tokens that should be displayed from sequential properties of the attribution scores. Our approach is dynamic across sentences, method-agnostic, and deals with sentence length bias. We compare agreement between multiple methods and humans on an NLI task, using fixed k and dynamic k. We find that perturbation-based methods and Vanilla Gradient exhibit highest agreement on most method--method and method--human agreement metrics with a static k. Their advantage over other methods disappears with dynamic ks which mainly improve Integrated Gradient and GradientXInput. To our knowledge, this is the first evidence that sequential properties of attribution scores are informative for consolidating attribution signals for human interpretation.

* Short paper accepted to EMNLP 2023 main conference

Via

Access Paper or Ask Questions

Improving and Evaluating the Detection of Fragmentation in News Recommendations with the Clustering of News Story Chains

Sep 18, 2023

Alessandra Polimeno, Myrthe Reuver, Sanne Vrijenhoek, Antske Fokkens

Figure 1 for Improving and Evaluating the Detection of Fragmentation in News Recommendations with the Clustering of News Story Chains

Figure 2 for Improving and Evaluating the Detection of Fragmentation in News Recommendations with the Clustering of News Story Chains

Figure 3 for Improving and Evaluating the Detection of Fragmentation in News Recommendations with the Clustering of News Story Chains

Abstract:News recommender systems play an increasingly influential role in shaping information access within democratic societies. However, tailoring recommendations to users' specific interests can result in the divergence of information streams. Fragmented access to information poses challenges to the integrity of the public sphere, thereby influencing democracy and public discourse. The Fragmentation metric quantifies the degree of fragmentation of information streams in news recommendations. Accurate measurement of this metric requires the application of Natural Language Processing (NLP) to identify distinct news events, stories, or timelines. This paper presents an extensive investigation of various approaches for quantifying Fragmentation in news recommendations. These approaches are evaluated both intrinsically, by measuring performance on news story clustering, and extrinsically, by assessing the Fragmentation scores of different simulated news recommender scenarios. Our findings demonstrate that agglomerative hierarchical clustering coupled with SentenceBERT text representation is substantially better at detecting Fragmentation than earlier implementations. Additionally, the analysis of simulated scenarios yields valuable insights and recommendations for stakeholders concerning the measurement and interpretation of Fragmentation.

* NORMalize 2023: The First Workshop on the Normative Design and Evaluation of Recommender Systems, September 19, 2023, co-located with the ACM Conference on Recommender Systems 2023 (RecSys 2023), Singapore
* Cite published version: Polimeno et. al., Improving and Evaluating the Detection of Fragmentation in News Recommendations with the Clustering of News Story Chains, NORMalize 2023: The First Workshop on the Normative Design and Evaluation of Recommender Systems, September 19, 2023, co-located with the ACM Conference on Recommender Systems 2023 (RecSys 2023), Singapore

Via

Access Paper or Ask Questions