Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guanyi Chen

Do Large Language Models Judge Error Severity Like Humans?

Jun 05, 2025

Diege Sun, Guanyi Chen, Fan Zhao, Xiaorong Cheng, Tingting He

Figure 1 for Do Large Language Models Judge Error Severity Like Humans?

Figure 2 for Do Large Language Models Judge Error Severity Like Humans?

Figure 3 for Do Large Language Models Judge Error Severity Like Humans?

Figure 4 for Do Large Language Models Judge Error Severity Like Humans?

Abstract:Large Language Models (LLMs) are increasingly used as automated evaluators in natural language generation, yet it remains unclear whether they can accurately replicate human judgments of error severity. In this study, we systematically compare human and LLM assessments of image descriptions containing controlled semantic errors. We extend the experimental framework of van Miltenburg et al. (2020) to both unimodal (text-only) and multimodal (text + image) settings, evaluating four error types: age, gender, clothing type, and clothing colour. Our findings reveal that humans assign varying levels of severity to different error types, with visual context significantly amplifying perceived severity for colour and type errors. Notably, most LLMs assign low scores to gender errors but disproportionately high scores to colour errors, unlike humans, who judge both as highly severe but for different reasons. This suggests that these models may have internalised social norms influencing gender judgments but lack the perceptual grounding to emulate human sensitivity to colour, which is shaped by distinct neural mechanisms. Only one of the evaluated LLMs, Doubao, replicates the human-like ranking of error severity, but it fails to distinguish between error types as clearly as humans. Surprisingly, DeepSeek-V3, a unimodal LLM, achieves the highest alignment with human judgments across both unimodal and multimodal conditions, outperforming even state-of-the-art multimodal models.

Via

Access Paper or Ask Questions

Emotional Supporters often Use Multiple Strategies in a Single Turn

May 21, 2025

Xin Bai, Guanyi Chen, Tingting He, Chenlian Zhou, Yu Liu

Abstract:Emotional Support Conversations (ESC) are crucial for providing empathy, validation, and actionable guidance to individuals in distress. However, existing definitions of the ESC task oversimplify the structure of supportive responses, typically modelling them as single strategy-utterance pairs. Through a detailed corpus analysis of the ESConv dataset, we identify a common yet previously overlooked phenomenon: emotional supporters often employ multiple strategies consecutively within a single turn. We formally redefine the ESC task to account for this, proposing a revised formulation that requires generating the full sequence of strategy-utterance pairs given a dialogue history. To facilitate this refined task, we introduce several modelling approaches, including supervised deep learning models and large language models. Our experiments show that, under this redefined task, state-of-the-art LLMs outperform both supervised models and human supporters. Notably, contrary to some earlier findings, we observe that LLMs frequently ask questions and provide suggestions, demonstrating more holistic support capabilities.

Via

Access Paper or Ask Questions

CCNU at SemEval-2025 Task 3: Leveraging Internal and External Knowledge of Large Language Models for Multilingual Hallucination Annotation

May 17, 2025

Xu Liu, Guanyi Chen

Abstract:We present the system developed by the Central China Normal University (CCNU) team for the Mu-SHROOM shared task, which focuses on identifying hallucinations in question-answering systems across 14 different languages. Our approach leverages multiple Large Language Models (LLMs) with distinct areas of expertise, employing them in parallel to annotate hallucinations, effectively simulating a crowdsourcing annotation process. Furthermore, each LLM-based annotator integrates both internal and external knowledge related to the input during the annotation process. Using the open-source LLM DeepSeek-V3, our system achieves the top ranking (\#1) for Hindi data and secures a Top-5 position in seven other languages. In this paper, we also discuss unsuccessful approaches explored during our development process and share key insights gained from participating in this shared task.

* SemEval-2025 Task 3

Via

Access Paper or Ask Questions

A computational transition for detecting correlated stochastic block models by low-degree polynomials

Sep 02, 2024

Guanyi Chen, Jian Ding, Shuyang Gong, Zhangsong Li

Abstract:Detection of correlation in a pair of random graphs is a fundamental statistical and computational problem that has been extensively studied in recent years. In this work, we consider a pair of correlated (sparse) stochastic block models $\mathcal{S}(n,\tfrac{\lambda}{n};k,\epsilon;s)$ that are subsampled from a common parent stochastic block model $\mathcal S(n,\tfrac{\lambda}{n};k,\epsilon)$ with $k=O(1)$ symmetric communities, average degree $\lambda=O(1)$, divergence parameter $\epsilon$, and subsampling probability $s$. For the detection problem of distinguishing this model from a pair of independent Erd\H{o}s-R\'enyi graphs with the same edge density $\mathcal{G}(n,\tfrac{\lambda s}{n})$, we focus on tests based on \emph{low-degree polynomials} of the entries of the adjacency matrices, and we determine the threshold that separates the easy and hard regimes. More precisely, we show that this class of tests can distinguish these two models if and only if $s> \min \{ \sqrt{\alpha}, \frac{1}{\lambda \epsilon^2} \}$, where $\alpha\approx 0.338$ is the Otter's constant and $\frac{1}{\lambda \epsilon^2}$ is the Kesten-Stigum threshold. Our proof of low-degree hardness is based on a conditional variant of the low-degree likelihood calculation.

* 75 pages, 2 figures

Via

Access Paper or Ask Questions

Computational Modelling of Plurality and Definiteness in Chinese Noun Phrases

Mar 07, 2024

Yuqi Liu, Guanyi Chen, Kees van Deemter

Figure 1 for Computational Modelling of Plurality and Definiteness in Chinese Noun Phrases

Figure 2 for Computational Modelling of Plurality and Definiteness in Chinese Noun Phrases

Figure 3 for Computational Modelling of Plurality and Definiteness in Chinese Noun Phrases

Figure 4 for Computational Modelling of Plurality and Definiteness in Chinese Noun Phrases

Abstract:Theoretical linguists have suggested that some languages (e.g., Chinese and Japanese) are "cooler" than other languages based on the observation that the intended meaning of phrases in these languages depends more on their contexts. As a result, many expressions in these languages are shortened, and their meaning is inferred from the context. In this paper, we focus on the omission of the plurality and definiteness markers in Chinese noun phrases (NPs) to investigate the predictability of their intended meaning given the contexts. To this end, we built a corpus of Chinese NPs, each of which is accompanied by its corresponding context, and by labels indicating its singularity/plurality and definiteness/indefiniteness. We carried out corpus assessments and analyses. The results suggest that Chinese speakers indeed drop plurality and definiteness markers very frequently. Building on the corpus, we train a bank of computational models using both classic machine learning models and state-of-the-art pre-trained language models to predict the plurality and definiteness of each NP. We report on the performance of these models and analyse their behaviours.

* Accepted to LREC-COLING 2024

Via

Access Paper or Ask Questions

Intrinsic Task-based Evaluation for Referring Expression Generation

Feb 12, 2024

Guanyi Chen, Fahime Same, Kees van Deemter

Abstract:Recently, a human evaluation study of Referring Expression Generation (REG) models had an unexpected conclusion: on \textsc{webnlg}, Referring Expressions (REs) generated by the state-of-the-art neural models were not only indistinguishable from the REs in \textsc{webnlg} but also from the REs generated by a simple rule-based system. Here, we argue that this limitation could stem from the use of a purely ratings-based human evaluation (which is a common practice in Natural Language Generation). To investigate these issues, we propose an intrinsic task-based evaluation for REG models, in which, in addition to rating the quality of REs, participants were asked to accomplish two meta-level tasks. One of these tasks concerns the referential success of each RE; the other task asks participants to suggest a better alternative for each RE. The outcomes suggest that, in comparison to previous evaluations, the new evaluation protocol assesses the performance of each REG model more comprehensively and makes the participants' ratings more reliable and discriminable.

Via

Access Paper or Ask Questions

A Survey on Semantic Processing Techniques

Oct 22, 2023

Rui Mao, Kai He, Xulang Zhang, Guanyi Chen, Jinjie Ni, Zonglin Yang, Erik Cambria

Abstract:Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.

* Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for details

Via

Access Paper or Ask Questions

GPTEval: A Survey on Assessments of ChatGPT and GPT-4

Aug 24, 2023

Rui Mao, Guanyi Chen, Xulang Zhang, Frank Guerin, Erik Cambria

Figure 1 for GPTEval: A Survey on Assessments of ChatGPT and GPT-4

Figure 2 for GPTEval: A Survey on Assessments of ChatGPT and GPT-4

Figure 3 for GPTEval: A Survey on Assessments of ChatGPT and GPT-4

Figure 4 for GPTEval: A Survey on Assessments of ChatGPT and GPT-4

Abstract:The emergence of ChatGPT has generated much speculation in the press about its potential to disrupt social and economic systems. Its astonishing language ability has aroused strong curiosity among scholars about its performance in different domains. There have been many studies evaluating the ability of ChatGPT and GPT-4 in different tasks and disciplines. However, a comprehensive review summarizing the collective assessment findings is lacking. The objective of this survey is to thoroughly analyze prior assessments of ChatGPT and GPT-4, focusing on its language and reasoning abilities, scientific knowledge, and ethical considerations. Furthermore, an examination of the existing evaluation methods is conducted, offering several recommendations for future research in evaluating large language models.

Via

Access Paper or Ask Questions

Models of reference production: How do they withstand the test of time?

Jul 27, 2023

Fahime Same, Guanyi Chen, Kees van Deemter

Figure 1 for Models of reference production: How do they withstand the test of time?

Figure 2 for Models of reference production: How do they withstand the test of time?

Figure 3 for Models of reference production: How do they withstand the test of time?

Figure 4 for Models of reference production: How do they withstand the test of time?

Abstract:In recent years, many NLP studies have focused solely on performance improvement. In this work, we focus on the linguistic and scientific aspects of NLP. We use the task of generating referring expressions in context (REG-in-context) as a case study and start our analysis from GREC, a comprehensive set of shared tasks in English that addressed this topic over a decade ago. We ask what the performance of models would be if we assessed them (1) on more realistic datasets, and (2) using more advanced methods. We test the models using different evaluation metrics and feature selection experiments. We conclude that GREC can no longer be regarded as offering a reliable assessment of models' ability to mimic human reference production, because the results are highly impacted by the choice of corpus and evaluation metrics. Our results also suggest that pre-trained language models are less dependent on the choice of corpus than classic Machine Learning models, and therefore make more robust class predictions.

* Accepted to INLG 2023

Via

Access Paper or Ask Questions

Assessing Neural Referential Form Selectors on a Realistic Multilingual Dataset

Oct 11, 2022

Guanyi Chen, Fahime Same, Kees van Deemter

Figure 1 for Assessing Neural Referential Form Selectors on a Realistic Multilingual Dataset

Figure 2 for Assessing Neural Referential Form Selectors on a Realistic Multilingual Dataset

Figure 3 for Assessing Neural Referential Form Selectors on a Realistic Multilingual Dataset

Figure 4 for Assessing Neural Referential Form Selectors on a Realistic Multilingual Dataset

Abstract:Previous work on Neural Referring Expression Generation (REG) all uses WebNLG, an English dataset that has been shown to reflect a very limited range of referring expression (RE) use. To tackle this issue, we build a dataset based on the OntoNotes corpus that contains a broader range of RE use in both English and Chinese (a language that uses zero pronouns). We build neural Referential Form Selection (RFS) models accordingly, assess them on the dataset and conduct probing experiments. The experiments suggest that, compared to WebNLG, OntoNotes is better for assessing REG/RFS models. We compare English and Chinese RFS and confirm that, in line with linguistic theories, Chinese RFS depends more on discourse context than English.

* Eval4NLP workshop

Via

Access Paper or Ask Questions