Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mohammad Taher Pilehvar

Cardiff University

Exploring State Tracking Capabilities of Large Language Models

Nov 13, 2025

Kiamehr Rezaee, Jose Camacho-Collados, Mohammad Taher Pilehvar

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.

Via

Access Paper or Ask Questions

MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables

Sep 15, 2025

Matteo Marcuzzo, Alessandro Zangari, Andrea Albarelli, Jose Camacho-Collados, Mohammad Taher Pilehvar

Abstract:As LLMs excel on standard reading comprehension benchmarks, attention is shifting toward evaluating their capacity for complex abstract reasoning and inference. Literature-based benchmarks, with their rich narrative and moral depth, provide a compelling framework for evaluating such deeper comprehension skills. Here, we present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature. The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering. To further stress-test model robustness, we introduce adversarial variants designed to surface LLM vulnerabilities and shortcuts due to issues such as data contamination. Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning. This brittleness results in significant self-contradiction, with the best models refuting their own answers in roughly 20% of cases depending on the framing of the moral choice. Interestingly, reasoning-enhanced models fail to bridge this gap, suggesting that scale - not reasoning ability - is the primary driver of performance.

* Accepted to EMNLP 2025 Main Conference

Via

Access Paper or Ask Questions

Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs

Apr 26, 2025

Mohammad Akbar-Tajari, Mohammad Taher Pilehvar, Mohammad Mahmoody

Abstract:The challenge of ensuring Large Language Models (LLMs) align with societal standards is of increasing interest, as these models are still prone to adversarial jailbreaks that bypass their safety mechanisms. Identifying these vulnerabilities is crucial for enhancing the robustness of LLMs against such exploits. We propose Graph of ATtacks (GoAT), a method for generating adversarial prompts to test the robustness of LLM alignment using the Graph of Thoughts framework [Besta et al., 2024]. GoAT excels at generating highly effective jailbreak prompts with fewer queries to the victim model than state-of-the-art attacks, achieving up to five times better jailbreak success rate against robust models like Llama. Notably, GoAT creates high-quality, human-readable prompts without requiring access to the targeted model's parameters, making it a black-box attack. Unlike approaches constrained by tree-based reasoning, GoAT's reasoning is based on a more intricate graph structure. By making simultaneous attack paths aware of each other's progress, this dynamic framework allows a deeper integration and refinement of reasoning paths, significantly enhancing the collaborative exploration of adversarial vulnerabilities in LLMs. At a technical level, GoAT starts with a graph structure and iteratively refines it by combining and improving thoughts, enabling synergy between different thought paths. The code for our implementation can be found at: https://github.com/GoAT-pydev/Graph_of_Attacks.

* 19 pages, 1 figure, 6 tables

Via

Access Paper or Ask Questions

Gender Encoding Patterns in Pretrained Language Model Representations

Mar 09, 2025

Mahdi Zakizadeh, Mohammad Taher Pilehvar

Figure 1 for Gender Encoding Patterns in Pretrained Language Model Representations

Figure 2 for Gender Encoding Patterns in Pretrained Language Model Representations

Figure 3 for Gender Encoding Patterns in Pretrained Language Model Representations

Figure 4 for Gender Encoding Patterns in Pretrained Language Model Representations

Abstract:Gender bias in pretrained language models (PLMs) poses significant social and ethical challenges. Despite growing awareness, there is a lack of comprehensive investigation into how different models internally represent and propagate such biases. This study adopts an information-theoretic approach to analyze how gender biases are encoded within various encoder-based architectures. We focus on three key aspects: identifying how models encode gender information and biases, examining the impact of bias mitigation techniques and fine-tuning on the encoded biases and their effectiveness, and exploring how model design differences influence the encoding of biases. Through rigorous and systematic investigation, our findings reveal a consistent pattern of gender encoding across diverse models. Surprisingly, debiasing techniques often exhibit limited efficacy, sometimes inadvertently increasing the encoded bias in internal representations while reducing bias in model output distributions. This highlights a disconnect between mitigating bias in output distributions and addressing its internal representations. This work provides valuable guidance for advancing bias mitigation strategies and fostering the development of more equitable language models.

* Proceedings of the 5th Workshop on Trustworthy Natural Language Processing (TrustNLP 2025)

Via

Access Paper or Ask Questions

PerCul: A Story-Driven Cultural Evaluation of LLMs in Persian

Feb 11, 2025

Erfan Moosavi Monazzah, Vahid Rahimzadeh, Yadollah Yaghoobzadeh, Azadeh Shakery, Mohammad Taher Pilehvar

Abstract:Large language models predominantly reflect Western cultures, largely due to the dominance of English-centric training data. This imbalance presents a significant challenge, as LLMs are increasingly used across diverse contexts without adequate evaluation of their cultural competence in non-English languages, including Persian. To address this gap, we introduce PerCul, a carefully constructed dataset designed to assess the sensitivity of LLMs toward Persian culture. PerCul features story-based, multiple-choice questions that capture culturally nuanced scenarios. Unlike existing benchmarks, PerCul is curated with input from native Persian annotators to ensure authenticity and to prevent the use of translation as a shortcut. We evaluate several state-of-the-art multilingual and Persian-specific LLMs, establishing a foundation for future research in cross-cultural NLP evaluation. Our experiments demonstrate a 11.3% gap between best closed source model and layperson baseline while the gap increases to 21.3% by using the best open-weight model. You can access the dataset from here: https://huggingface.co/datasets/teias-ai/percul

* Accepted at NAACL 2025 Main Conference, the dataset is available on HuggingFace (see https://huggingface.co/datasets/teias-ai/percul)

Via

Access Paper or Ask Questions

Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets

Jan 02, 2025

Mahdi Zakizadeh, Mohammad Taher Pilehvar

Figure 1 for Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets

Figure 2 for Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets

Figure 3 for Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets

Figure 4 for Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets

Abstract:The multifaceted challenge of accurately measuring gender stereotypical bias in language models is akin to discerning different segments of a broader, unseen entity. This short paper primarily focuses on intrinsic bias mitigation and measurement strategies for language models, building on prior research that demonstrates a lack of correlation between intrinsic and extrinsic approaches. We delve deeper into intrinsic measurements, identifying inconsistencies and suggesting that these benchmarks may reflect different facets of gender stereotype. Our methodology involves analyzing data distributions across datasets and integrating gender stereotype components informed by social psychology. By adjusting the distribution of two datasets, we achieve a better alignment of outcomes. Our findings underscore the complexity of gender stereotyping in language models and point to new directions for developing more refined techniques to detect and reduce bias.

Via

Access Paper or Ask Questions

FarExStance: Explainable Stance Detection for Farsi

Dec 18, 2024

Majid Zarharan, Maryam Hashemi, Malika Behroozrazegh, Sauleh Eetemadi, Mohammad Taher Pilehvar, Jennifer Foster

Figure 1 for FarExStance: Explainable Stance Detection for Farsi

Figure 2 for FarExStance: Explainable Stance Detection for Farsi

Figure 3 for FarExStance: Explainable Stance Detection for Farsi

Figure 4 for FarExStance: Explainable Stance Detection for Farsi

Abstract:We introduce FarExStance, a new dataset for explainable stance detection in Farsi. Each instance in this dataset contains a claim, the stance of an article or social media post towards that claim, and an extractive explanation which provides evidence for the stance label. We compare the performance of a fine-tuned multilingual RoBERTa model to several large language models in zero-shot, few-shot, and parameter-efficient fine-tuned settings on our new dataset. On stance detection, the most accurate models are the fine-tuned RoBERTa model, the LLM Aya-23-8B which has been fine-tuned using parameter-efficient fine-tuning, and few-shot Claude-3.5-Sonnet. Regarding the quality of the explanations, our automatic evaluation metrics indicate that few-shot GPT-4o generates the most coherent explanations, while our human evaluation reveals that the best Overall Explanation Score (OES) belongs to few-shot Claude-3.5-Sonnet. The fine-tuned Aya-32-8B model produced explanations most closely aligned with the reference explanations.

* Accepted in COLING 2025

Via

Access Paper or Ask Questions

NormXLogit: The Head-on-Top Never Lies

Nov 25, 2024

Sina Abbasi, Mohammad Reza Modarres, Mohammad Taher Pilehvar

Figure 1 for NormXLogit: The Head-on-Top Never Lies

Figure 2 for NormXLogit: The Head-on-Top Never Lies

Figure 3 for NormXLogit: The Head-on-Top Never Lies

Figure 4 for NormXLogit: The Head-on-Top Never Lies

Abstract:The Transformer architecture has emerged as the dominant choice for building large language models (LLMs). However, with new LLMs emerging on a frequent basis, it is important to consider the potential value of architecture-agnostic approaches that can provide interpretability across a variety of architectures. Despite recent successes in the interpretability of LLMs, many existing approaches rely on complex methods that are often tied to a specific model design and come with a significant computational cost. To address these limitations, we propose a novel technique, called NormXLogit, for assessing the significance of individual input tokens. This method operates based on the input and output representations associated with each token. First, we demonstrate that during the pre-training of LLMs, the norms of word embeddings capture the importance of input tokens. Second, we reveal a significant relationship between a token's importance and the extent to which its representation can resemble the model's final prediction. Through extensive analysis, we show that our approach consistently outperforms existing gradient-based methods in terms of faithfulness. Additionally, our method achieves better performance in layer-wise explanations compared to the most prominent architecture-specific methods.

Via

Access Paper or Ask Questions

LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions

Nov 24, 2024

Faridoun Mehri, Mahdieh Soleymani Baghshah, Mohammad Taher Pilehvar

Figure 1 for LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions

Figure 2 for LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions

Figure 3 for LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions

Figure 4 for LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions

Abstract:Why do gradient-based explanations struggle with Transformers, and how can we improve them? We identify gradient flow imbalances in Transformers that violate FullGrad-completeness, a critical property for attribution faithfulness that CNNs naturally possess. To address this issue, we introduce LibraGrad -- a theoretically grounded post-hoc approach that corrects gradient imbalances through pruning and scaling of backward paths, without changing the forward pass or adding computational overhead. We evaluate LibraGrad using three metric families: Faithfulness, which quantifies prediction changes under perturbations of the most and least relevant features; Completeness Error, which measures attribution conservation relative to model outputs; and Segmentation AP, which assesses alignment with human perception. Extensive experiments across 8 architectures, 4 model sizes, and 4 datasets show that LibraGrad universally enhances gradient-based methods, outperforming existing white-box methods -- including Transformer-specific approaches -- across all metrics. We demonstrate superior qualitative results through two complementary evaluations: precise text-prompted region highlighting on CLIP models and accurate class discrimination between co-occurring animals on ImageNet-finetuned models -- two settings on which existing methods often struggle. LibraGrad is effective even on the attention-free MLP-Mixer architecture, indicating potential for extension to other modern architectures. Our code is freely available at https://github.com/NightMachinery/LibraGrad.

Via

Access Paper or Ask Questions

RepMatch: Quantifying Cross-Instance Similarities in Representation Space

Oct 12, 2024

Mohammad Reza Modarres, Sina Abbasi, Mohammad Taher Pilehvar

Figure 1 for RepMatch: Quantifying Cross-Instance Similarities in Representation Space

Figure 2 for RepMatch: Quantifying Cross-Instance Similarities in Representation Space

Figure 3 for RepMatch: Quantifying Cross-Instance Similarities in Representation Space

Figure 4 for RepMatch: Quantifying Cross-Instance Similarities in Representation Space

Abstract:Advances in dataset analysis techniques have enabled more sophisticated approaches to analyzing and characterizing training data instances, often categorizing data based on attributes such as ``difficulty''. In this work, we introduce RepMatch, a novel method that characterizes data through the lens of similarity. RepMatch quantifies the similarity between subsets of training instances by comparing the knowledge encoded in models trained on them, overcoming the limitations of existing analysis methods that focus solely on individual instances and are restricted to within-dataset analysis. Our framework allows for a broader evaluation, enabling similarity comparisons across arbitrary subsets of instances, supporting both dataset-to-dataset and instance-to-dataset analyses. We validate the effectiveness of RepMatch across multiple NLP tasks, datasets, and models. Through extensive experimentation, we demonstrate that RepMatch can effectively compare datasets, identify more representative subsets of a dataset (that lead to better performance than randomly selected subsets of equivalent size), and uncover heuristics underlying the construction of some challenge datasets.

Via

Access Paper or Ask Questions