Alert button
Picture for Piyawat Lertvittayakumjorn

Piyawat Lertvittayakumjorn

Alert button

Label-Aware Automatic Verbalizer for Few-Shot Text Classification

Oct 19, 2023
Thanakorn Thaminkaew, Piyawat Lertvittayakumjorn, Peerapon Vateekul

Figure 1 for Label-Aware Automatic Verbalizer for Few-Shot Text Classification
Figure 2 for Label-Aware Automatic Verbalizer for Few-Shot Text Classification
Figure 3 for Label-Aware Automatic Verbalizer for Few-Shot Text Classification
Figure 4 for Label-Aware Automatic Verbalizer for Few-Shot Text Classification

Prompt-based learning has shown its effectiveness in few-shot text classification. One important factor in its success is a verbalizer, which translates output from a language model into a predicted class. Notably, the simplest and widely acknowledged verbalizer employs manual labels to represent the classes. However, manual selection does not guarantee the optimality of the selected words when conditioned on the chosen language model. Therefore, we propose Label-Aware Automatic Verbalizer (LAAV), effectively augmenting the manual labels to achieve better few-shot classification results. Specifically, we use the manual labels along with the conjunction "and" to induce the model to generate more effective words for the verbalizer. The experimental results on five datasets across five languages demonstrate that LAAV significantly outperforms existing verbalizers. Furthermore, our analysis reveals that LAAV suggests more relevant words compared to similar approaches, especially in mid-to-low resource languages.

Viaarxiv icon

Towards Explainable Evaluation Metrics for Machine Translation

Jun 22, 2023
Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger

Figure 1 for Towards Explainable Evaluation Metrics for Machine Translation
Figure 2 for Towards Explainable Evaluation Metrics for Machine Translation
Figure 3 for Towards Explainable Evaluation Metrics for Machine Translation
Figure 4 for Towards Explainable Evaluation Metrics for Machine Translation

Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics for machine translation (for example, COMET or BERTScore) are based on black-box large language models. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are more transparent. To foster more widespread acceptance of novel high-quality metrics, explainability thus becomes crucial. In this concept paper, we identify key properties as well as key goals of explainable machine translation metrics and provide a comprehensive synthesis of recent techniques, relating them to our established goals and properties. In this context, we also discuss the latest state-of-the-art approaches to explainable metrics based on generative models such as ChatGPT and GPT4. Finally, we contribute a vision of next-generation approaches, including natural language explanations. We hope that our work can help catalyze and guide future research on explainable evaluation metrics and, mediately, also contribute to better and more transparent machine translation systems.

* Preprint. We published an earlier version of this paper (arXiv:2203.11131) under a different title. Both versions consider the conceptualization of explainable metrics and are overall similar. However, the new version puts a stronger emphasis on the survey of approaches for the explanation of MT metrics including the latest LLM based approaches 
Viaarxiv icon

Argumentative Explanations for Pattern-Based Text Classifiers

May 22, 2022
Piyawat Lertvittayakumjorn, Francesca Toni

Figure 1 for Argumentative Explanations for Pattern-Based Text Classifiers
Figure 2 for Argumentative Explanations for Pattern-Based Text Classifiers
Figure 3 for Argumentative Explanations for Pattern-Based Text Classifiers
Figure 4 for Argumentative Explanations for Pattern-Based Text Classifiers

Recent works in Explainable AI mostly address the transparency issue of black-box models or create explanations for any kind of models (i.e., they are model-agnostic), while leaving explanations of interpretable models largely underexplored. In this paper, we fill this gap by focusing on explanations for a specific interpretable model, namely pattern-based logistic regression (PLR) for binary text classification. We do so because, albeit interpretable, PLR is challenging when it comes to explanations. In particular, we found that a standard way to extract explanations from this model does not consider relations among the features, making the explanations hardly plausible to humans. Hence, we propose AXPLR, a novel explanation method using (forms of) computational argumentation to generate explanations (for outputs computed by PLR) which unearth model agreements and disagreements among the features. Specifically, we use computational argumentation as follows: we see features (patterns) in PLR as arguments in a form of quantified bipolar argumentation frameworks (QBAFs) and extract attacks and supports between arguments based on specificity of the arguments; we understand logistic regression as a gradual semantics for these QBAFs, used to determine the arguments' dialectic strength; and we study standard properties of gradual semantics for QBAFs in the context of our argumentative re-interpretation of PLR, sanctioning its suitability for explanatory purposes. We then show how to extract intuitive explanations (for outputs computed by PLR) from the constructed QBAFs. Finally, we conduct an empirical evaluation and two experiments in the context of human-AI collaboration to demonstrate the advantages of our resulting AXPLR method.

Viaarxiv icon

Towards Explainable Evaluation Metrics for Natural Language Generation

Mar 21, 2022
Christoph Leiter, Piyawat Lertvittayakumjorn, Marina Fomicheva, Wei Zhao, Yang Gao, Steffen Eger

Figure 1 for Towards Explainable Evaluation Metrics for Natural Language Generation
Figure 2 for Towards Explainable Evaluation Metrics for Natural Language Generation
Figure 3 for Towards Explainable Evaluation Metrics for Natural Language Generation
Figure 4 for Towards Explainable Evaluation Metrics for Natural Language Generation

Unlike classical lexical overlap metrics such as BLEU, most current evaluation metrics (such as BERTScore or MoverScore) are based on black-box language models such as BERT or XLM-R. They often achieve strong correlations with human judgments, but recent research indicates that the lower-quality classical metrics remain dominant, one of the potential reasons being that their decision processes are transparent. To foster more widespread acceptance of the novel high-quality metrics, explainability thus becomes crucial. In this concept paper, we identify key properties and propose key goals of explainable machine translation evaluation metrics. We also provide a synthesizing overview over recent approaches for explainable machine translation metrics and discuss how they relate to those goals and properties. Further, we conduct own novel experiments, which (among others) find that current adversarial NLP techniques are unsuitable for automatically identifying limitations of high-quality black-box evaluation metrics, as they are not meaning-preserving. Finally, we provide a vision of future approaches to explainable evaluation metrics and their evaluation. We hope that our work can help catalyze and guide future research on explainable evaluation metrics and, mediately, also contribute to better and more transparent text generation systems.

Viaarxiv icon

The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results

Oct 08, 2021
Marina Fomicheva, Piyawat Lertvittayakumjorn, Wei Zhao, Steffen Eger, Yang Gao

Figure 1 for The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results
Figure 2 for The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results
Figure 3 for The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results
Figure 4 for The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results

In this paper, we introduce the Eval4NLP-2021shared task on explainable quality estimation. Given a source-translation pair, this shared task requires not only to provide a sentence-level score indicating the overall quality of the translation, but also to explain this score by identifying the words that negatively impact translation quality. We present the data, annotation guidelines and evaluation setup of the shared task, describe the six participating systems, and analyze the results. To the best of our knowledge, this is the first shared task on explainable NLP evaluation metrics. Datasets and results are available at https://github.com/eval4nlp/SharedTask2021.

Viaarxiv icon

Explanation-Based Human Debugging of NLP Models: A Survey

Apr 30, 2021
Piyawat Lertvittayakumjorn, Francesca Toni

Figure 1 for Explanation-Based Human Debugging of NLP Models: A Survey
Figure 2 for Explanation-Based Human Debugging of NLP Models: A Survey
Figure 3 for Explanation-Based Human Debugging of NLP Models: A Survey

To fix a bug in a program, we need to locate where the bug is, understand why it causes the problem, and patch the code accordingly. This process becomes harder when the program is a trained machine learning model and even harder for opaque deep learning models. In this survey, we review papers that exploit explanations to enable humans to debug NLP models. We call this problem explanation-based human debugging (EBHD). In particular, we categorize and discuss existing works along three main dimensions of EBHD (the bug context, the workflow, and the experimental setting), compile findings on how EBHD components affect human debuggers, and highlight open problems that could be future research directions.

Viaarxiv icon

GrASP: A Library for Extracting and Exploring Human-Interpretable Textual Patterns

Apr 08, 2021
Piyawat Lertvittayakumjorn, Leshem Choshen, Eyal Shnarch, Francesca Toni

Figure 1 for GrASP: A Library for Extracting and Exploring Human-Interpretable Textual Patterns
Figure 2 for GrASP: A Library for Extracting and Exploring Human-Interpretable Textual Patterns
Figure 3 for GrASP: A Library for Extracting and Exploring Human-Interpretable Textual Patterns

Data exploration is an important step of every data science and machine learning project, including those involving textual data. We provide a Python library for GrASP, an existing algorithm for drawing patterns from textual data. The library is equipped with a web-based interface empowering human users to conveniently explore the data and the extracted patterns. We also demonstrate the use of the library in two settings (spam detection and argument mining) and discuss future deployments of the library, e.g., beyond textual data exploration.

* 4 pages, 2 figures 
Viaarxiv icon

DAX: Deep Argumentative eXplanation for Neural Networks

Dec 10, 2020
Emanuele Albini, Piyawat Lertvittayakumjorn, Antonio Rago, Francesca Toni

Figure 1 for DAX: Deep Argumentative eXplanation for Neural Networks
Figure 2 for DAX: Deep Argumentative eXplanation for Neural Networks
Figure 3 for DAX: Deep Argumentative eXplanation for Neural Networks
Figure 4 for DAX: Deep Argumentative eXplanation for Neural Networks

Despite the rapid growth in attention on eXplainable AI (XAI) of late, explanations in the literature provide little insight into the actual functioning of Neural Networks (NNs), significantly limiting their transparency. We propose a methodology for explaining NNs, providing transparency about their inner workings, by utilising computational argumentation (a form of symbolic AI offering reasoning abstractions for a variety of settings where opinions matter) as the scaffolding underpinning Deep Argumentative eXplanations (DAXs). We define three DAX instantiations (for various neural architectures and tasks) and evaluate them empirically in terms of stability, computational cost, and importance of depth. We also conduct human experiments with DAXs for text classification models, indicating that they are comprehensible to humans and align with their judgement, while also being competitive, in terms of user acceptance, with existing approaches to XAI that also have an argumentative spirit.

* 19 pages, 15 figures 
Viaarxiv icon

FIND: Human-in-the-Loop Debugging Deep Text Classifiers

Oct 10, 2020
Piyawat Lertvittayakumjorn, Lucia Specia, Francesca Toni

Figure 1 for FIND: Human-in-the-Loop Debugging Deep Text Classifiers
Figure 2 for FIND: Human-in-the-Loop Debugging Deep Text Classifiers
Figure 3 for FIND: Human-in-the-Loop Debugging Deep Text Classifiers
Figure 4 for FIND: Human-in-the-Loop Debugging Deep Text Classifiers

Since obtaining a perfect training dataset (i.e., a dataset which is considerably large, unbiased, and well-representative of unseen cases) is hardly possible, many real-world text classifiers are trained on the available, yet imperfect, datasets. These classifiers are thus likely to have undesirable properties. For instance, they may have biases against some sub-populations or may not work effectively in the wild due to overfitting. In this paper, we propose FIND -- a framework which enables humans to debug deep learning text classifiers by disabling irrelevant hidden features. Experiments show that by using FIND, humans can improve CNN text classifiers which were trained under different types of imperfect datasets (including datasets with biases and datasets with dissimilar train-test distributions).

* 17 pages including appendices; To appear at EMNLP 2020 
Viaarxiv icon

Human-grounded Evaluations of Explanation Methods for Text Classification

Aug 29, 2019
Piyawat Lertvittayakumjorn, Francesca Toni

Figure 1 for Human-grounded Evaluations of Explanation Methods for Text Classification
Figure 2 for Human-grounded Evaluations of Explanation Methods for Text Classification
Figure 3 for Human-grounded Evaluations of Explanation Methods for Text Classification
Figure 4 for Human-grounded Evaluations of Explanation Methods for Text Classification

Due to the black-box nature of deep learning models, methods for explaining the models' results are crucial to gain trust from humans and support collaboration between AIs and humans. In this paper, we consider several model-agnostic and model-specific explanation methods for CNNs for text classification and conduct three human-grounded evaluations, focusing on different purposes of explanations: (1) revealing model behavior, (2) justifying model predictions, and (3) helping humans investigate uncertain predictions. The results highlight dissimilar qualities of the various explanation methods we consider and show the degree to which these methods could serve for each purpose.

* 17 pages including appendices; accepted to appear at EMNLP-IJCNLP 2019 
Viaarxiv icon