Alert button
Picture for Edoardo Mosca

Edoardo Mosca

Alert button

IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models

Mar 06, 2023
Edoardo Mosca, Daryna Dementieva, Tohid Ebrahim Ajdari, Maximilian Kummeth, Kirill Gringauz, Georg Groh

Figure 1 for IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models
Figure 2 for IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models
Figure 3 for IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models
Figure 4 for IFAN: An Explainability-Focused Interaction Framework for Humans and NLP Models

Interpretability and human oversight are fundamental pillars of deploying complex NLP models into real-world applications. However, applying explainability and human-in-the-loop methods requires technical proficiency. Despite existing toolkits for model understanding and analysis, options to integrate human feedback are still limited. We propose IFAN, a framework for real-time explanation-based interaction with NLP models. Through IFAN's interface, users can provide feedback to selected model explanations, which is then integrated through adapter layers to align the model with human rationale. We show the system to be effective in debiasing a hate speech classifier with minimal performance loss. IFAN also offers a visual admin system and API to manage models (and datasets) as well as control access rights. A demo is live at https://ifan.ml/

* ACL Demo 2023 Submission 
Viaarxiv icon

"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

Apr 10, 2022
Edoardo Mosca, Shreyash Agarwal, Javier Rando-Ramirez, Georg Groh

Figure 1 for "That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks
Figure 2 for "That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks
Figure 3 for "That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks
Figure 4 for "That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.

* ACL 2022 
Viaarxiv icon