Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Gabriele Dominici

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

May 04, 2026

Francesco Sovrano, Gabriele Dominici, Marc Langheinrich

Abstract:A key goal of explainable AI (XAI) is to express the decision logic of large language models (LLMs) in symbolic form and link it to internal mechanisms. Global rule-extraction methods typically learn symbolic surrogates without grounding rules in model circuitry, while mechanistic interpretability can connect behaviors to neuron sets but often depends on hand-crafted hypotheses and expensive neuron-level interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by efficiently localizing sparse neurons called agonists, whose activation neutralization disrupts rule-related behaviors. MechaRule rests on two empirical observations. First, within a fixed baseline/flip regime, sparse agonist effects can be approximately monotone and saturating: a few dominant neuron activations can overtop weaker ones at coarse scales, while overlapping neurons flip many of the same examples. This motivates viewing localization as adaptive group testing driven by a regime-conditional strength predicate with confidence-guided conservative pruning, yielding Theta(k log(N/k) + k) interventions over N candidates when k << N neurons are agonists under the monotone-overtopping abstraction. Second, agonists emerge more reliably when ablations are verified through data splits aligned with close-to-faithful rule behavior; spectral splits remain a useful rule-free fallback, while unfaithful splits degrade localization. Empirically, overtopping appears mainly in learned, task-aligned regimes: on arithmetic and jailbreak tasks across Qwen2 and GPT-J, MechaRule recalls 96.8% of high-effect brute-force agonists in completed comparisons, and suppressing localized agonists reduces arithmetic accuracy and jailbreak success by up to 71.1% and 8.8%, respectively.

Via

Access Paper or Ask Questions

Federated Concept-Based Models: Interpretable models with distributed supervision

Feb 04, 2026

Dario Fenoglio, Arianna Casanova, Francesco De Santis, Mohan Li, Gabriele Dominici, Johannes Schneider, Martin Gjoreski, Marc Langheinrich, Pietro Barbiero, Giovanni De Felice

Abstract:Concept-based models (CMs) enhance interpretability in deep learning by grounding predictions in human-understandable concepts. However, concept annotations are expensive to obtain and rarely available at scale within a single data source. Federated learning (FL) could alleviate this limitation by enabling cross-institutional training that leverages concept annotations distributed across multiple data owners. Yet, FL lacks interpretable modeling paradigms. Integrating CMs with FL is non-trivial: CMs assume a fixed concept space and a predefined model architecture, whereas real-world FL is heterogeneous and non-stationary, with institutions joining over time and bringing new supervision. In this work, we propose Federated Concept-based Models (F-CMs), a new methodology for deploying CMs in evolving FL settings. F-CMs aggregate concept-level information across institutions and efficiently adapt the model architecture in response to changes in the available concept supervision, while preserving institutional privacy. Empirically, F-CMs preserve the accuracy and intervention effectiveness of training settings with full concept supervision, while outperforming non-adaptive federated baselines. Notably, F-CMs enable interpretable inference on concepts not available to a given institution, a key novelty with respect to existing approaches.

Via

Access Paper or Ask Questions

Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts

Apr 24, 2025

Mateo Espinosa Zarlenga, Gabriele Dominici, Pietro Barbiero, Zohreh Shams, Mateja Jamnik

Figure 1 for Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts

Figure 2 for Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts

Figure 3 for Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts

Figure 4 for Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts

Abstract:In this paper, we investigate how concept-based models (CMs) respond to out-of-distribution (OOD) inputs. CMs are interpretable neural architectures that first predict a set of high-level concepts (e.g., stripes, black) and then predict a task label from those concepts. In particular, we study the impact of concept interventions (i.e., operations where a human expert corrects a CM's mispredicted concepts at test time) on CMs' task predictions when inputs are OOD. Our analysis reveals a weakness in current state-of-the-art CMs, which we term leakage poisoning, that prevents them from properly improving their accuracy when intervened on for OOD inputs. To address this, we introduce MixCEM, a new CM that learns to dynamically exploit leaked information missing from its concepts only when this information is in-distribution. Our results across tasks with and without complete sets of concept annotations demonstrate that MixCEMs outperform strong baselines by significantly improving their accuracy for both in-distribution and OOD samples in the presence and absence of concept interventions.

Via

Access Paper or Ask Questions

Deferring Concept Bottleneck Models: Learning to Defer Interventions to Inaccurate Experts

Mar 20, 2025

Andrea Pugnana, Riccardo Massidda, Francesco Giannini, Pietro Barbiero, Mateo Espinosa Zarlenga, Roberto Pellungrini, Gabriele Dominici, Fosca Giannotti, Davide Bacciu

Abstract:Concept Bottleneck Models (CBMs) are machine learning models that improve interpretability by grounding their predictions on human-understandable concepts, allowing for targeted interventions in their decision-making process. However, when intervened on, CBMs assume the availability of humans that can identify the need to intervene and always provide correct interventions. Both assumptions are unrealistic and impractical, considering labor costs and human error-proneness. In contrast, Learning to Defer (L2D) extends supervised learning by allowing machine learning models to identify cases where a human is more likely to be correct than the model, thus leading to deferring systems with improved performance. In this work, we gain inspiration from L2D and propose Deferring CBMs (DCBMs), a novel framework that allows CBMs to learn when an intervention is needed. To this end, we model DCBMs as a composition of deferring systems and derive a consistent L2D loss to train them. Moreover, by relying on a CBM architecture, DCBMs can explain why defer occurs on the final task. Our results show that DCBMs achieve high predictive performance and interpretability at the cost of deferring more to humans.

Via

Access Paper or Ask Questions

Evaluating Explanations Through LLMs: Beyond Traditional User Studies

Oct 23, 2024

Francesco Bombassei De Bona, Gabriele Dominici, Tim Miller, Marc Langheinrich, Martin Gjoreski

Figure 1 for Evaluating Explanations Through LLMs: Beyond Traditional User Studies

Figure 2 for Evaluating Explanations Through LLMs: Beyond Traditional User Studies

Figure 3 for Evaluating Explanations Through LLMs: Beyond Traditional User Studies

Figure 4 for Evaluating Explanations Through LLMs: Beyond Traditional User Studies

Abstract:As AI becomes fundamental in sectors like healthcare, explainable AI (XAI) tools are essential for trust and transparency. However, traditional user studies used to evaluate these tools are often costly, time consuming, and difficult to scale. In this paper, we explore the use of Large Language Models (LLMs) to replicate human participants to help streamline XAI evaluation. We reproduce a user study comparing counterfactual and causal explanations, replicating human participants with seven LLMs under various settings. Our results show that (i) LLMs can replicate most conclusions from the original study, (ii) different LLMs yield varying levels of alignment in the results, and (iii) experimental factors such as LLM memory and output variability affect alignment with human responses. These initial findings suggest that LLMs could provide a scalable and cost-effective way to simplify qualitative XAI evaluation.

Via

Access Paper or Ask Questions

DC3DO: Diffusion Classifier for 3D Objects

Aug 13, 2024

Nursena Koprucu, Meher Shashwat Nigam, Shicheng Xu, Biruk Abere, Gabriele Dominici, Andrew Rodriguez, Sharvaree Vadgam, Berfin Inal, Alberto Tono

Figure 1 for DC3DO: Diffusion Classifier for 3D Objects

Figure 2 for DC3DO: Diffusion Classifier for 3D Objects

Figure 3 for DC3DO: Diffusion Classifier for 3D Objects

Figure 4 for DC3DO: Diffusion Classifier for 3D Objects

Abstract:Inspired by Geoffrey Hinton emphasis on generative modeling, To recognize shapes, first learn to generate them, we explore the use of 3D diffusion models for object classification. Leveraging the density estimates from these models, our approach, the Diffusion Classifier for 3D Objects (DC3DO), enables zero-shot classification of 3D shapes without additional training. On average, our method achieves a 12.5 percent improvement compared to its multiview counterparts, demonstrating superior multimodal reasoning over discriminative approaches. DC3DO employs a class-conditional diffusion model trained on ShapeNet, and we run inferences on point clouds of chairs and cars. This work highlights the potential of generative models in 3D object classification.

Via

Access Paper or Ask Questions

Causal Concept Embedding Models: Beyond Causal Opacity in Deep Learning

May 28, 2024

Gabriele Dominici, Pietro Barbiero, Mateo Espinosa Zarlenga, Alberto Termine, Martin Gjoreski, Giuseppe Marra, Marc Langheinrich

Figure 1 for Causal Concept Embedding Models: Beyond Causal Opacity in Deep Learning

Figure 2 for Causal Concept Embedding Models: Beyond Causal Opacity in Deep Learning

Figure 3 for Causal Concept Embedding Models: Beyond Causal Opacity in Deep Learning

Figure 4 for Causal Concept Embedding Models: Beyond Causal Opacity in Deep Learning

Abstract:Causal opacity denotes the difficulty in understanding the "hidden" causal structure underlying a deep neural network's (DNN) reasoning. This leads to the inability to rely on and verify state-of-the-art DNN-based systems especially in high-stakes scenarios. For this reason, causal opacity represents a key open challenge at the intersection of deep learning, interpretability, and causality. This work addresses this gap by introducing Causal Concept Embedding Models (Causal CEMs), a class of interpretable models whose decision-making process is causally transparent by design. The results of our experiments show that Causal CEMs can: (i) match the generalization performance of causally-opaque models, (ii) support the analysis of interventional and counterfactual scenarios, thereby improving the model's causal interpretability and supporting the effective verification of its reliability and fairness, and (iii) enable human-in-the-loop corrections to mispredicted intermediate reasoning steps, boosting not just downstream accuracy after corrections but also accuracy of the explanation provided for a specific instance.

Via

Access Paper or Ask Questions

AnyCBMs: How to Turn Any Black Box into a Concept Bottleneck Model

May 26, 2024

Gabriele Dominici, Pietro Barbiero, Francesco Giannini, Martin Gjoreski, Marc Langhenirich

Figure 1 for AnyCBMs: How to Turn Any Black Box into a Concept Bottleneck Model

Figure 2 for AnyCBMs: How to Turn Any Black Box into a Concept Bottleneck Model

Figure 3 for AnyCBMs: How to Turn Any Black Box into a Concept Bottleneck Model

Figure 4 for AnyCBMs: How to Turn Any Black Box into a Concept Bottleneck Model

Abstract:Interpretable deep learning aims at developing neural architectures whose decision-making processes could be understood by their users. Among these techniqes, Concept Bottleneck Models enhance the interpretability of neural networks by integrating a layer of human-understandable concepts. These models, however, necessitate training a new model from the beginning, consuming significant resources and failing to utilize already trained large models. To address this issue, we introduce "AnyCBM", a method that transforms any existing trained model into a Concept Bottleneck Model with minimal impact on computational resources. We provide both theoretical and experimental insights showing the effectiveness of AnyCBMs in terms of classification performances and effectivenss of concept-based interventions on downstream tasks.

Via

Access Paper or Ask Questions

Federated Behavioural Planes: Explaining the Evolution of Client Behaviour in Federated Learning

May 24, 2024

Dario Fenoglio, Gabriele Dominici, Pietro Barbiero, Alberto Tonda, Martin Gjoreski, Marc Langheinrich

Abstract:Federated Learning (FL), a privacy-aware approach in distributed deep learning environments, enables many clients to collaboratively train a model without sharing sensitive data, thereby reducing privacy risks. However, enabling human trust and control over FL systems requires understanding the evolving behaviour of clients, whether beneficial or detrimental for the training, which still represents a key challenge in the current literature. To address this challenge, we introduce Federated Behavioural Planes (FBPs), a novel method to analyse, visualise, and explain the dynamics of FL systems, showing how clients behave under two different lenses: predictive performance (error behavioural space) and decision-making processes (counterfactual behavioural space). Our experiments demonstrate that FBPs provide informative trajectories describing the evolving states of clients and their contributions to the global model, thereby enabling the identification of clusters of clients with similar behaviours. Leveraging the patterns identified by FBPs, we propose a robust aggregation technique named Federated Behavioural Shields to detect malicious or noisy client models, thereby enhancing security and surpassing the efficacy of existing state-of-the-art FL defense mechanisms.

* [v1] Preprint (24 pages)

Via

Access Paper or Ask Questions

Climbing the Ladder of Interpretability with Counterfactual Concept Bottleneck Models

Feb 02, 2024

Gabriele Dominici, Pietro Barbiero, Francesco Giannini, Martin Gjoreski, Giuseppe Marra, Marc Langheinrich

Figure 1 for Climbing the Ladder of Interpretability with Counterfactual Concept Bottleneck Models

Figure 2 for Climbing the Ladder of Interpretability with Counterfactual Concept Bottleneck Models

Figure 3 for Climbing the Ladder of Interpretability with Counterfactual Concept Bottleneck Models

Figure 4 for Climbing the Ladder of Interpretability with Counterfactual Concept Bottleneck Models

Abstract:Current deep learning models are not designed to simultaneously address three fundamental questions: predict class labels to solve a given classification task (the "What?"), explain task predictions (the "Why?"), and imagine alternative scenarios that could result in different predictions (the "What if?"). The inability to answer these questions represents a crucial gap in deploying reliable AI agents, calibrating human trust, and deepening human-machine interaction. To bridge this gap, we introduce CounterFactual Concept Bottleneck Models (CF-CBMs), a class of models designed to efficiently address the above queries all at once without the need to run post-hoc searches. Our results show that CF-CBMs produce: accurate predictions (the "What?"), simple explanations for task predictions (the "Why?"), and interpretable counterfactuals (the "What if?"). CF-CBMs can also sample or estimate the most probable counterfactual to: (i) explain the effect of concept interventions on tasks, (ii) show users how to get a desired class label, and (iii) propose concept interventions via "task-driven" interventions.

Via

Access Paper or Ask Questions