Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jey Han Lau

Interaction Matters: An Evaluation Framework for Interactive Dialogue Assessment on English Second Language Conversations

Jul 09, 2024

Rena Gao, Carsten Roever, Jey Han Lau

Figure 1 for Interaction Matters: An Evaluation Framework for Interactive Dialogue Assessment on English Second Language Conversations

Figure 2 for Interaction Matters: An Evaluation Framework for Interactive Dialogue Assessment on English Second Language Conversations

Figure 3 for Interaction Matters: An Evaluation Framework for Interactive Dialogue Assessment on English Second Language Conversations

Figure 4 for Interaction Matters: An Evaluation Framework for Interactive Dialogue Assessment on English Second Language Conversations

Abstract:We present an evaluation framework for interactive dialogue assessment in the context of English as a Second Language (ESL) speakers. Our framework collects dialogue-level interactivity labels (e.g., topic management; 4 labels in total) and micro-level span features (e.g., backchannels; 17 features in total). Given our annotated data, we study how the micro-level features influence the (higher level) interactivity quality of ESL dialogues by constructing various machine learning-based models. Our results demonstrate that certain micro-level features strongly correlate with interactivity quality, like reference word (e.g., she, her, he), revealing new insights about the interaction between higher-level dialogue quality and lower-level linguistic signals. Our framework also provides a means to assess ESL communication, which is useful for language assessment.

Via

Access Paper or Ask Questions

Factual Dialogue Summarization via Learning from Large Language Models

Jun 20, 2024

Rongxin Zhu, Jey Han Lau, Jianzhong Qi

Abstract:Factual consistency is an important quality in dialogue summarization. Large language model (LLM)-based automatic text summarization models generate more factually consistent summaries compared to those by smaller pretrained language models, but they face deployment challenges in real-world applications due to privacy or resource constraints. In this paper, we investigate the use of symbolic knowledge distillation to improve the factual consistency of smaller pretrained models for dialogue summarization. We employ zero-shot learning to extract symbolic knowledge from LLMs, generating both factually consistent (positive) and inconsistent (negative) summaries. We then apply two contrastive learning objectives on these summaries to enhance smaller summarization models. Experiments with BART, PEGASUS, and Flan-T5 indicate that our approach surpasses strong baselines that rely on complex data augmentation strategies. Our approach achieves better factual consistency while maintaining coherence, fluency, and relevance, as confirmed by various automatic evaluation metrics. We also provide access to the data and code to facilitate future research.

Via

Access Paper or Ask Questions

Evaluating Transparency of Machine Generated Fact Checking Explanations

Jun 18, 2024

Rui Xing, Timothy Baldwin, Jey Han Lau

Figure 1 for Evaluating Transparency of Machine Generated Fact Checking Explanations

Figure 2 for Evaluating Transparency of Machine Generated Fact Checking Explanations

Figure 3 for Evaluating Transparency of Machine Generated Fact Checking Explanations

Figure 4 for Evaluating Transparency of Machine Generated Fact Checking Explanations

Abstract:An important factor when it comes to generating fact-checking explanations is the selection of evidence: intuitively, high-quality explanations can only be generated given the right evidence. In this work, we investigate the impact of human-curated vs. machine-selected evidence for explanation generation using large language models. To assess the quality of explanations, we focus on transparency (whether an explanation cites sources properly) and utility (whether an explanation is helpful in clarifying a claim). Surprisingly, we found that large language models generate similar or higher quality explanations using machine-selected evidence, suggesting carefully curated evidence (by humans) may not be necessary. That said, even with the best model, the generated explanations are not always faithful to the sources, suggesting further room for improvement in explanation generation for fact-checking.

Via

Access Paper or Ask Questions

Exploring Multi-Document Information Consolidation for Scientific Sentiment Summarization

Feb 28, 2024

Miao Li, Jey Han Lau, Eduard Hovy

Figure 1 for Exploring Multi-Document Information Consolidation for Scientific Sentiment Summarization

Figure 2 for Exploring Multi-Document Information Consolidation for Scientific Sentiment Summarization

Figure 3 for Exploring Multi-Document Information Consolidation for Scientific Sentiment Summarization

Figure 4 for Exploring Multi-Document Information Consolidation for Scientific Sentiment Summarization

Abstract:Modern natural language generation systems with LLMs exhibit the capability to generate a plausible summary of multiple documents; however, it is uncertain if models truly possess the ability of information consolidation to generate summaries, especially on those source documents with opinionated information. To make scientific sentiment summarization more grounded, we hypothesize that in peer review human meta-reviewers follow a three-layer framework of sentiment consolidation to write meta-reviews and it represents the logic of summarizing scientific sentiments in meta-review generation. The framework is validated via human annotation. Based on the framework, we propose evaluation metrics to assess the quality of generated meta-reviews, and we find that the hypothesis of the sentiment consolidation framework works out empirically when we incorporate it as prompts for LLMs to generate meta-reviews in extensive experiments.

* 18 pages

Via

Access Paper or Ask Questions

CMA-R:Causal Mediation Analysis for Explaining Rumour Detection

Feb 13, 2024

Lin Tian, Xiuzhen Zhang, Jey Han Lau

Abstract:We apply causal mediation analysis to explain the decision-making process of neural models for rumour detection on Twitter. Interventions at the input and network level reveal the causal impacts of tweets and words in the model output. We find that our approach CMA-R -- Causal Mediation Analysis for Rumour detection -- identifies salient tweets that explain model predictions and show strong agreement with human judgements for critical tweets determining the truthfulness of stories. CMA-R can further highlight causally impactful words in the salient tweets, providing another layer of interpretability and transparency into these blackbox rumour detection systems. Code is available at: https://github.com/ltian678/cma-r.

* 9 pages, 7 figures, Accepted by EACL 2024 Findings

Via

Access Paper or Ask Questions

Unsupervised Lexical Simplification with Context Augmentation

Nov 01, 2023

Takashi Wada, Timothy Baldwin, Jey Han Lau

Figure 1 for Unsupervised Lexical Simplification with Context Augmentation

Figure 2 for Unsupervised Lexical Simplification with Context Augmentation

Figure 3 for Unsupervised Lexical Simplification with Context Augmentation

Figure 4 for Unsupervised Lexical Simplification with Context Augmentation

Abstract:We propose a new unsupervised lexical simplification method that uses only monolingual data and pre-trained language models. Given a target word and its context, our method generates substitutes based on the target context and also additional contexts sampled from monolingual data. We conduct experiments in English, Portuguese, and Spanish on the TSAR-2022 shared task, and show that our model substantially outperforms other unsupervised systems across all languages. We also establish a new state-of-the-art by ensembling our model with GPT-3.5. Lastly, we evaluate our model on the SWORDS lexical substitution data set, achieving a state-of-the-art result.

* 12 pages; accepted for the Findings of EMNLP 2023

Via

Access Paper or Ask Questions

Unsupervised Paraphrasing of Multiword Expressions

Jun 02, 2023

Takashi Wada, Yuji Matsumoto, Timothy Baldwin, Jey Han Lau

Abstract:We propose an unsupervised approach to paraphrasing multiword expressions (MWEs) in context. Our model employs only monolingual corpus data and pre-trained language models (without fine-tuning), and does not make use of any external resources such as dictionaries. We evaluate our method on the SemEval 2022 idiomatic semantic text similarity task, and show that it outperforms all unsupervised systems and rivals supervised systems.

* 13 pages; accepted for Findings of ACL 2023

Via

Access Paper or Ask Questions

Annotating and Detecting Fine-grained Factual Errors for Dialogue Summarization

May 26, 2023

Rongxin Zhu, Jianzhong Qi, Jey Han Lau

Abstract:A series of datasets and models have been proposed for summaries generated for well-formatted documents such as news articles. Dialogue summaries, however, have been under explored. In this paper, we present the first dataset with fine-grained factual error annotations named DIASUMFACT. We define fine-grained factual error detection as a sentence-level multi-label classification problem, and we evaluate two state-of-the-art (SOTA) models on our dataset. Both models yield sub-optimal results, with a macro-averaged F1 score of around 0.25 over 6 error classes. We further propose an unsupervised model ENDERANKER via candidate ranking using pretrained encoder-decoder models. Our model performs on par with the SOTA models while requiring fewer resources. These observations confirm the challenges in detecting factual errors from dialogue summaries, which call for further studies, for which our dataset and results offer a solid foundation.

* Accepted in ACL 2023

Via

Access Paper or Ask Questions

Towards Summarizing Multiple Documents with Hierarchical Relationships

May 02, 2023

Miao Li, Eduard Hovy, Jey Han Lau

Abstract:Most existing multi-document summarization (MDS) datasets lack human-generated and genuine (i.e., not synthetic) summaries or source documents with explicit inter-document relationships that a summary must capture. To enhance the capabilities of MDS systems we present PeerSum, a novel dataset for generating meta-reviews of scientific papers, where the meta-reviews are highly abstractive and genuine summaries of reviews and corresponding discussions. These source documents have rich inter-document relationships of an explicit hierarchical structure with cross-references and often feature conflicts. As there is a scarcity of research that incorporates hierarchical relationships into MDS systems through attention manipulation on pre-trained language models, we additionally present Rammer (Relationship-aware Multi-task Meta-review Generator), a meta-review generation model that uses sparse attention based on the hierarchical relationships and a multi-task objective that predicts several metadata features in addition to the standard text generation objective. Our experimental results show that PeerSum is a challenging dataset, and Rammer outperforms other strong baseline MDS models under various evaluation metrics.

* 10 pages

Via

Access Paper or Ask Questions

DeltaScore: Evaluating Story Generation with Differentiating Perturbations

Mar 15, 2023

Zhuohan Xie, Miao Li, Trevor Cohn, Jey Han Lau

Abstract:Various evaluation metrics exist for natural language generation tasks, but they have limited utility for story generation since they generally do not correlate well with human judgments and do not measure fine-grained story aspects, such as fluency versus relatedness, as they are intended to assess overall generation quality. In this paper, we propose deltascore, an approach that utilizes perturbation to evaluate fine-grained story aspects. Our core idea is based on the hypothesis that the better the story performs in a specific aspect (e.g., fluency), the more it will be affected by a particular perturbation (e.g., introducing typos). To measure the impact, we calculate the likelihood difference between the pre- and post-perturbation stories using a language model. We evaluate deltascore against state-of-the-art model-based and traditional similarity-based metrics across multiple story domains, and investigate its correlation with human judgments on five fine-grained story aspects: fluency, coherence, relatedness, logicality, and interestingness. Our results demonstrate that deltascore performs impressively in evaluating fine-grained story aspects, and we discovered a striking outcome where a specific perturbation appears to be highly effective in measuring most aspects.

Via

Access Paper or Ask Questions