Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shorouq Zahra

Climate-Eval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change

May 24, 2025

Murathan Kurfalı, Shorouq Zahra, Joakim Nivre, Gabriele Messori

Abstract:Climate-Eval is a comprehensive benchmark designed to evaluate natural language processing models across a broad range of tasks related to climate change. Climate-Eval aggregates existing datasets along with a newly developed news classification dataset, created specifically for this release. This results in a benchmark of 25 tasks based on 13 datasets, covering key aspects of climate discourse, including text classification, question answering, and information extraction. Our benchmark provides a standardized evaluation suite for systematically assessing the performance of large language models (LLMs) on these tasks. Additionally, we conduct an extensive evaluation of open-source LLMs (ranging from 2B to 70B parameters) in both zero-shot and few-shot settings, analyzing their strengths and limitations in the domain of climate change.

* Accepted to ClimateNLP 2025@ACL

Via

Access Paper or Ask Questions

Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?

Apr 29, 2025

Evangelia Gogoulou, Shorouq Zahra, Liane Guillou, Luise Dürlich, Joakim Nivre

Abstract:A frequently observed problem with LLMs is their tendency to generate output that is nonsensical, illogical, or factually incorrect, often referred to broadly as hallucination. Building on the recently proposed HalluciGen task for hallucination detection and generation, we evaluate a suite of open-access LLMs on their ability to detect intrinsic hallucinations in two conditional generation tasks: translation and paraphrasing. We study how model performance varies across tasks and language and we investigate the impact of model size, instruction tuning, and prompt choice. We find that performance varies across models but is consistent across prompts. Finally, we find that NLI models perform comparably well, suggesting that LLM-based detectors are not the only viable option for this specific task.

Via

Access Paper or Ask Questions