Abstract:Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.
Abstract:This paper presents the development of a lexicon centered on emerging concepts, focusing on non-technological innovation. It introduces a four-step methodology that combines human expertise, statistical analysis, and machine learning techniques to establish a model that can be generalized across multiple domains. This process includes the creation of a thematic corpus, the development of a Gold Standard Lexicon, annotation and preparation of a training corpus, and finally, the implementation of learning models to identify new terms. The results demonstrate the robustness and relevance of our approach, highlighting its adaptability to various contexts and its contribution to lexical research. The developed methodology promises applicability in conceptual fields.