Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chuyuan Li

BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

Nov 17, 2025

Chuyuan Li, Giuseppe Carenini

Figure 1 for BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

Figure 2 for BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

Figure 3 for BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

Figure 4 for BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models

Abstract:We introduce BeDiscovER (Benchmark of Discourse Understanding in the Era of Reasoning Language Models), an up-to-date, comprehensive suite for evaluating the discourse-level knowledge of modern LLMs. BeDiscovER compiles 5 publicly available discourse tasks across discourse lexicon, (multi-)sentential, and documental levels, with in total 52 individual datasets. It covers both extensively studied tasks such as discourse parsing and temporal relation extraction, as well as some novel challenges such as discourse particle disambiguation (e.g., ``just''), and also aggregates a shared task on Discourse Relation Parsing and Treebanking for multilingual and multi-framework discourse relation classification. We evaluate open-source LLMs: Qwen3 series, DeepSeek-R1, and frontier model such as GPT-5-mini on BeDiscovER, and find that state-of-the-art models exhibit strong performance in arithmetic aspect of temporal reasoning, but they struggle with full document reasoning and some subtle semantic and discourse phenomena, such as rhetorical relation recognition.

Via

Access Paper or Ask Questions

Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Oct 25, 2025

Federica Gamba, Aman Sinha, Timothee Mickus, Raul Vazquez, Patanjali Bhamidipati, Claudio Savelli, Ahana Chattopadhyay, Laura A. Zanella, Yash Kankanampati, Binesh Arakkal Remesh(+5 more)

Figure 1 for Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Figure 2 for Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Figure 3 for Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Figure 4 for Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Abstract:We introduce the CAP (Confabulations from ACL Publications) dataset, a multilingual resource for studying hallucinations in large language models (LLMs) within scientific text generation. CAP focuses on the scientific domain, where hallucinations can distort factual knowledge, as they frequently do. In this domain, however, the presence of specialized terminology, statistical reasoning, and context-dependent interpretations further exacerbates these distortions, particularly given LLMs' lack of true comprehension, limited contextual understanding, and bias toward surface-level generalization. CAP operates in a cross-lingual setting covering five high-resource languages (English, French, Hindi, Italian, and Spanish) and four low-resource languages (Bengali, Gujarati, Malayalam, and Telugu). The dataset comprises 900 curated scientific questions and over 7000 LLM-generated answers from 16 publicly available models, provided as question-answer pairs along with token sequences and corresponding logits. Each instance is annotated with a binary label indicating the presence of a scientific hallucination, denoted as a factuality error, and a fluency label, capturing issues in the linguistic quality or naturalness of the text. CAP is publicly released to facilitate advanced research on hallucination detection, multilingual evaluation of LLMs, and the development of more reliable scientific NLP systems.

Via

Access Paper or Ask Questions

Multi$^2$: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

Feb 27, 2025

Juntai Cao, Xiang Zhang, Raymond Li, Chuyuan Li, Shafiq Joty, Giuseppe Carenini

Figure 1 for Multi$^2$: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

Figure 2 for Multi$^2$: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

Figure 3 for Multi$^2$: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

Figure 4 for Multi$^2$: Multi-Agent Test-Time Scalable Framework for Multi-Document Processing

Abstract:Recent advances in test-time scaling have shown promising results in improving Large Language Models (LLMs) performance through strategic computation allocation during inference. While this approach has demonstrated strong performance improvements in logical and mathematical reasoning tasks, its application to natural language generation (NLG), especially summarization, has yet to be explored. Multi-Document Summarization (MDS) is a challenging task that focuses on extracting and synthesizing useful information from multiple lengthy documents. Unlike reasoning tasks, MDS requires a more nuanced approach to prompt design and ensemble, as there is no "best" prompt to satisfy diverse summarization requirements. To address this, we propose a novel framework that leverages inference-time scaling for this task. Precisely, we take prompt ensemble approach by leveraging various prompt to first generate candidate summaries and then ensemble them with an aggregator to produce a refined summary. We also introduce two new evaluation metrics: Consistency-Aware Preference (CAP) score and LLM Atom-Content-Unit (ACU) score, to enhance LLM's contextual understanding while mitigating its positional bias. Extensive experiments demonstrate the effectiveness of our approach in improving summary quality while identifying and analyzing the scaling boundaries in summarization tasks.

Via

Access Paper or Ask Questions

Discourse Structure Extraction from Pre-Trained and Fine-Tuned Language Models in Dialogues

Feb 12, 2023

Chuyuan Li, Patrick Huber, Wen Xiao, Maxime Amblard, Chloé Braud, Giuseppe Carenini

Figure 1 for Discourse Structure Extraction from Pre-Trained and Fine-Tuned Language Models in Dialogues

Figure 2 for Discourse Structure Extraction from Pre-Trained and Fine-Tuned Language Models in Dialogues

Figure 3 for Discourse Structure Extraction from Pre-Trained and Fine-Tuned Language Models in Dialogues

Figure 4 for Discourse Structure Extraction from Pre-Trained and Fine-Tuned Language Models in Dialogues

Abstract:Discourse processing suffers from data sparsity, especially for dialogues. As a result, we explore approaches to build discourse structures for dialogues, based on attention matrices from Pre-trained Language Models (PLMs). We investigate multiple tasks for fine-tuning and show that the dialogue-tailored Sentence Ordering task performs best. To locate and exploit discourse information in PLMs, we propose an unsupervised and a semi-supervised method. Our proposals achieve encouraging results on the STAC corpus, with F1 scores of 57.2 and 59.3 for unsupervised and semi-supervised methods, respectively. When restricted to projective trees, our scores improved to 63.3 and 68.1.

Via

Access Paper or Ask Questions