Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Leonhard Hennig

Order in the Evaluation Court: A Critical Analysis of NLG Evaluation Trends

Jan 12, 2026

Jing Yang, Nils Feldhus, Salar Mohtaj, Leonhard Hennig, Qianli Wang, Eleni Metheniti, Sherzod Hakimov, Charlott Jakob, Veronika Solopova, Konrad Rieck(+3 more)

Abstract:Despite advances in Natural Language Generation (NLG), evaluation remains challenging. Although various new metrics and LLM-as-a-judge (LaaJ) methods are proposed, human judgment persists as the gold standard. To systematically review how NLG evaluation has evolved, we employ an automatic information extraction scheme to gather key information from NLG papers, focusing on different evaluation methods (metrics, LaaJ and human evaluation). With extracted metadata from 14,171 papers across four major conferences (ACL, EMNLP, NAACL, and INLG) over the past six years, we reveal several critical findings: (1) Task Divergence: While Dialogue Generation demonstrates a rapid shift toward LaaJ (>40% in 2025), Machine Translation remains locked into n-gram metrics, and Question Answering exhibits a substantial decline in the proportion of studies conducting human evaluation. (2) Metric Inertia: Despite the development of semantic metrics, general-purpose metrics (e.g., BLEU, ROUGE) continue to be widely used across tasks without empirical justification, often lacking the discriminative power to distinguish between specific quality criteria. (3) Human-LaaJ Divergence: Our association analysis challenges the assumption that LLMs act as mere proxies for humans; LaaJ and human evaluations prioritize very different signals, and explicit validation is scarce (<8% of papers comparing the two), with only moderate to low correlation. Based on these observations, we derive practical recommendations to improve the rigor of future NLG evaluation.

* 8 pages

Via

Access Paper or Ask Questions

Entity Linking using LLMs for Automated Product Carbon Footprint Estimation

Feb 11, 2025

Steffen Castle, Julian Moreno Schneider, Leonhard Hennig, Georg Rehm

Figure 1 for Entity Linking using LLMs for Automated Product Carbon Footprint Estimation

Figure 2 for Entity Linking using LLMs for Automated Product Carbon Footprint Estimation

Figure 3 for Entity Linking using LLMs for Automated Product Carbon Footprint Estimation

Figure 4 for Entity Linking using LLMs for Automated Product Carbon Footprint Estimation

Abstract:Growing concerns about climate change and sustainability are driving manufacturers to take significant steps toward reducing their carbon footprints. For these manufacturers, a first step towards this goal is to identify the environmental impact of the individual components of their products. We propose a system leveraging large language models (LLMs) to automatically map components from manufacturer Bills of Materials (BOMs) to Life Cycle Assessment (LCA) database entries by using LLMs to expand on available component information. Our approach reduces the need for manual data processing, paving the way for more accessible sustainability practices.

* Proceedings of The 1st Workshop on Ecology, Environment, and Natural Language Processing (2025)

Via

Access Paper or Ask Questions

Reverse Probing: Evaluating Knowledge Transfer via Finetuned Task Embeddings for Coreference Resolution

Jan 31, 2025

Tatiana Anikina, Arne Binder, David Harbecke, Stalin Varanasi, Leonhard Hennig, Simon Ostermann, Sebastian Möller, Josef van Genabith

Figure 1 for Reverse Probing: Evaluating Knowledge Transfer via Finetuned Task Embeddings for Coreference Resolution

Figure 2 for Reverse Probing: Evaluating Knowledge Transfer via Finetuned Task Embeddings for Coreference Resolution

Figure 3 for Reverse Probing: Evaluating Knowledge Transfer via Finetuned Task Embeddings for Coreference Resolution

Figure 4 for Reverse Probing: Evaluating Knowledge Transfer via Finetuned Task Embeddings for Coreference Resolution

Abstract:In this work, we reimagine classical probing to evaluate knowledge transfer from simple source to more complex target tasks. Instead of probing frozen representations from a complex source task on diverse simple target probing tasks (as usually done in probing), we explore the effectiveness of embeddings from multiple simple source tasks on a single target task. We select coreference resolution, a linguistically complex problem requiring contextual understanding, as focus target task, and test the usefulness of embeddings from comparably simpler tasks tasks such as paraphrase detection, named entity recognition, and relation extraction. Through systematic experiments, we evaluate the impact of individual and combined task embeddings. Our findings reveal that task embeddings vary significantly in utility for coreference resolution, with semantic similarity tasks (e.g., paraphrase detection) proving most beneficial. Additionally, representations from intermediate layers of fine-tuned models often outperform those from final layers. Combining embeddings from multiple tasks consistently improves performance, with attention-based aggregation yielding substantial gains. These insights shed light on relationships between task-specific representations and their adaptability to complex downstream tasks, encouraging further exploration of embedding-level task transfer.

Via

Access Paper or Ask Questions

Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Jun 10, 2024

Martin Courtois, Malte Ostendorff, Leonhard Hennig, Georg Rehm

Figure 1 for Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Figure 2 for Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Figure 3 for Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Figure 4 for Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Abstract:Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training datasets, and unsustainable amount of compute resources. The ubiquitous nature of the Transformer and its core component, the attention mechanism, are thus prime targets for efficiency research. In this work, we propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. This compatibility function exploits an overlap in the learned representation of the traditional scaled dot-product attention, leading to a symmetric with pairwise coefficient dot-product attention. When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation, leads to a reduction of 6% in the number of trainable parameters, and reduces the number of training steps required before convergence by half.

* to be published in Findings of the Association for Computational Linguistics: ACL 2024

Via

Access Paper or Ask Questions

LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools

Jan 23, 2024

Qianli Wang, Tatiana Anikina, Nils Feldhus, Josef van Genabith, Leonhard Hennig, Sebastian Möller

Figure 1 for LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools

Figure 2 for LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools

Figure 3 for LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools

Figure 4 for LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools

Abstract:Interpretability tools that offer explanations in the form of a dialogue have demonstrated their efficacy in enhancing users' understanding, as one-off explanations may occasionally fall short in providing sufficient information to the user. Current solutions for dialogue-based explanations, however, require many dependencies and are not easily transferable to tasks they were not designed for. With LLMCheckup, we present an easily accessible tool that allows users to chat with any state-of-the-art large language model (LLM) about its behavior. We enable LLMs to generate all explanations by themselves and take care of intent recognition without fine-tuning, by connecting them with a broad spectrum of Explainable AI (XAI) tools, e.g. feature attributions, embedding-based similarity, and prompting strategies for counterfactual and rationale generation. LLM (self-)explanations are presented as an interactive dialogue that supports follow-up questions and generates suggestions. LLMCheckup provides tutorials for operations available in the system, catering to individuals with varying levels of expertise in XAI and supports multiple input modalities. We introduce a new parsing strategy called multi-prompt parsing substantially enhancing the parsing accuracy of LLMs. Finally, we showcase the tasks of fact checking and commonsense question answering.

Via

Access Paper or Ask Questions

Factuality Detection using Machine Translation -- a Use Case for German Clinical Text

Aug 17, 2023

Mohammed Bin Sumait, Aleksandra Gabryszak, Leonhard Hennig, Roland Roller

Figure 1 for Factuality Detection using Machine Translation -- a Use Case for German Clinical Text

Figure 2 for Factuality Detection using Machine Translation -- a Use Case for German Clinical Text

Figure 3 for Factuality Detection using Machine Translation -- a Use Case for German Clinical Text

Figure 4 for Factuality Detection using Machine Translation -- a Use Case for German Clinical Text

Abstract:Factuality can play an important role when automatically processing clinical text, as it makes a difference if particular symptoms are explicitly not present, possibly present, not mentioned, or affirmed. In most cases, a sufficient number of examples is necessary to handle such phenomena in a supervised machine learning setting. However, as clinical text might contain sensitive information, data cannot be easily shared. In the context of factuality detection, this work presents a simple solution using machine translation to translate English data to German to train a transformer-based factuality detection model.

* Accepted at KONVENS 2023

Via

Access Paper or Ask Questions

MultiTACRED: A Multilingual Version of the TAC Relation Extraction Dataset

May 15, 2023

Leonhard Hennig, Philippe Thomas, Sebastian Möller

Abstract:Relation extraction (RE) is a fundamental task in information extraction, whose extension to multilingual settings has been hindered by the lack of supervised resources comparable in size to large English datasets such as TACRED (Zhang et al., 2017). To address this gap, we introduce the MultiTACRED dataset, covering 12 typologically diverse languages from 9 language families, which is created by machine-translating TACRED instances and automatically projecting their entity annotations. We analyze translation and annotation projection quality, identify error categories, and experimentally evaluate fine-tuned pretrained mono- and multilingual language models in common transfer learning scenarios. Our analyses show that machine translation is a viable strategy to transfer RE instances, with native speakers judging more than 83% of the translated instances to be linguistically and semantically acceptable. We find monolingual RE model performance to be comparable to the English original for many of the target languages, and that multilingual models trained on a combination of English and target language data can outperform their monolingual counterparts. However, we also observe a variety of translation and annotation projection errors, both due to the MT systems and linguistic features of the target languages, such as pronoun-dropping, compounding and inflection, that degrade dataset quality and RE model performance.

* Accepted at ACL 2023

Via

Access Paper or Ask Questions

Multilingual Relation Classification via Efficient and Effective Prompting

Oct 26, 2022

Yuxuan Chen, David Harbecke, Leonhard Hennig

Figure 1 for Multilingual Relation Classification via Efficient and Effective Prompting

Figure 2 for Multilingual Relation Classification via Efficient and Effective Prompting

Figure 3 for Multilingual Relation Classification via Efficient and Effective Prompting

Figure 4 for Multilingual Relation Classification via Efficient and Effective Prompting

Abstract:Prompting pre-trained language models has achieved impressive performance on various NLP tasks, especially in low data regimes. Despite the success of prompting in monolingual settings, applying prompt-based methods in multilingual scenarios has been limited to a narrow set of tasks, due to the high cost of handcrafting multilingual prompts. In this paper, we present the first work on prompt-based multilingual relation classification (RC), by introducing an efficient and effective method that constructs prompts from relation triples and involves only minimal translation for the class labels. We evaluate its performance in fully supervised, few-shot and zero-shot scenarios, and analyze its effectiveness across 14 languages, prompt variants, and English-task training in cross-lingual settings. We find that in both fully supervised and few-shot scenarios, our prompt method beats competitive baselines: fine-tuning XLM-R_EM and null prompts. It also outperforms the random baseline by a large margin in zero-shot experiments. Our method requires little in-language knowledge and can be used as a strong baseline for similar multilingual classification tasks.

* EMNLP 2022

Via

Access Paper or Ask Questions

Full-Text Argumentation Mining on Scientific Publications

Oct 24, 2022

Arne Binder, Bhuvanesh Verma, Leonhard Hennig

Figure 1 for Full-Text Argumentation Mining on Scientific Publications

Figure 2 for Full-Text Argumentation Mining on Scientific Publications

Figure 3 for Full-Text Argumentation Mining on Scientific Publications

Figure 4 for Full-Text Argumentation Mining on Scientific Publications

Abstract:Scholarly Argumentation Mining (SAM) has recently gained attention due to its potential to help scholars with the rapid growth of published scientific literature. It comprises two subtasks: argumentative discourse unit recognition (ADUR) and argumentative relation extraction (ARE), both of which are challenging since they require e.g. the integration of domain knowledge, the detection of implicit statements, and the disambiguation of argument structure. While previous work focused on dataset construction and baseline methods for specific document sections, such as abstract or results, full-text scholarly argumentation mining has seen little progress. In this work, we introduce a sequential pipeline model combining ADUR and ARE for full-text SAM, and provide a first analysis of the performance of pretrained language models (PLMs) on both subtasks. We establish a new SotA for ADUR on the Sci-Arg corpus, outperforming the previous best reported result by a large margin (+7% F1). We also present the first results for ARE, and thus for the full AM pipeline, on this benchmark dataset. Our detailed error analysis reveals that non-contiguous ADUs as well as the interpretation of discourse connectors pose major challenges and that data annotation needs to be more consistent.

Via

Access Paper or Ask Questions

Confidence estimation of classification based on the distribution of the neural network output layer

Oct 18, 2022

Abdel Aziz Taha, Leonhard Hennig, Petr Knoth

Figure 1 for Confidence estimation of classification based on the distribution of the neural network output layer

Figure 2 for Confidence estimation of classification based on the distribution of the neural network output layer

Figure 3 for Confidence estimation of classification based on the distribution of the neural network output layer

Figure 4 for Confidence estimation of classification based on the distribution of the neural network output layer

Abstract:One of the most common problems preventing the application of prediction models in the real world is lack of generalization: The accuracy of models, measured in the benchmark does repeat itself on future data, e.g. in the settings of real business. There is relatively little methods exist that estimate the confidence of prediction models. In this paper, we propose novel methods that, given a neural network classification model, estimate uncertainty of particular predictions generated by this model. Furthermore, we propose a method that, given a model and a confidence level, calculates a threshold that separates prediction generated by this model into two subsets, one of them meets the given confidence level. In contrast to other methods, the proposed methods do not require any changes on existing neural networks, because they simply build on the output logit layer of a common neural network. In particular, the methods infer the confidence of a particular prediction based on the distribution of the logit values corresponding to this prediction. The proposed methods constitute a tool that is recommended for filtering predictions in the process of knowledge extraction, e.g. based on web scrapping, where predictions subsets are identified that maximize the precision on cost of the recall, which is less important due to the availability of data. The method has been tested on different tasks including relation extraction, named entity recognition and image classification to show the significant increase of accuracy achieved.

* Draft

Via

Access Paper or Ask Questions