Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Florian Boudin

SciClaimEval: Cross-modal Claim Verification in Scientific Papers

Feb 07, 2026

Xanh Ho, Yun-Ang Wu, Sunisth Kumar, Tian Cheng Xia, Florian Boudin, Andre Greiner-Petter, Akiko Aizawa

Abstract:We present SciClaimEval, a new scientific dataset for the claim verification task. Unlike existing resources, SciClaimEval features authentic claims, including refuted ones, directly extracted from published papers. To create refuted claims, we introduce a novel approach that modifies the supporting evidence (figures and tables), rather than altering the claims or relying on large language models (LLMs) to fabricate contradictions. The dataset provides cross-modal evidence with diverse representations: figures are available as images, while tables are provided in multiple formats, including images, LaTeX source, HTML, and JSON. SciClaimEval contains 1,664 annotated samples from 180 papers across three domains, machine learning, natural language processing, and medicine, validated through expert annotation. We benchmark 11 multimodal foundation models, both open-source and proprietary, across the dataset. Results show that figure-based verification remains particularly challenging for all models, as a substantial performance gap remains between the best system and human baseline.

* 12 pages; data is available at https://sciclaimeval.github.io/

Via

Access Paper or Ask Questions

FC-CONAN: An Exhaustively Paired Dataset for Robust Evaluation of Retrieval Systems

Jan 04, 2026

Juan Junqueras, Florian Boudin, May-Myo Zin, Ha-Thanh Nguyen, Wachara Fungwacharakorn, Damián Ariel Furman, Akiko Aizawa, Ken Satoh

Abstract:Hate speech (HS) is a critical issue in online discourse, and one promising strategy to counter it is through the use of counter-narratives (CNs). Datasets linking HS with CNs are essential for advancing counterspeech research. However, even flagship resources like CONAN (Chung et al., 2019) annotate only a sparse subset of all possible HS-CN pairs, limiting evaluation. We introduce FC-CONAN (Fully Connected CONAN), the first dataset created by exhaustively considering all combinations of 45 English HS messages and 129 CNs. A two-stage annotation process involving nine annotators and four validators produces four partitions-Diamond, Gold, Silver, and Bronze-that balance reliability and scale. None of the labeled pairs overlap with CONAN, uncovering hundreds of previously unlabelled positives. FC-CONAN enables more faithful evaluation of counterspeech retrieval systems and facilitates detailed error analysis. The dataset is publicly available.

* Presented at NeLaMKRR@KR, 2025 (arXiv:2511.09575)

Via

Access Paper or Ask Questions

Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Nov 13, 2025

Xanh Ho, Yun-Ang Wu, Sunisth Kumar, Florian Boudin, Atsuhiro Takasu, Akiko Aizawa

Figure 1 for Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Figure 2 for Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Figure 3 for Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Figure 4 for Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Abstract:With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence. We further conduct human evaluations and observe that humans maintain strong performance across both formats, unlike the models. Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization. These findings highlight a critical gap in current models' multimodal reasoning capabilities. We suggest that future multimodal LLMs should place greater emphasis on improving chart understanding to better support scientific claim verification.

* Accepted at AAAI 2026

Via

Access Paper or Ask Questions

An Analysis of Datasets, Metrics and Models in Keyphrase Generation

Jun 12, 2025

Florian Boudin, Akiko Aizawa

Figure 1 for An Analysis of Datasets, Metrics and Models in Keyphrase Generation

Figure 2 for An Analysis of Datasets, Metrics and Models in Keyphrase Generation

Figure 3 for An Analysis of Datasets, Metrics and Models in Keyphrase Generation

Figure 4 for An Analysis of Datasets, Metrics and Models in Keyphrase Generation

Abstract:Keyphrase generation refers to the task of producing a set of words or phrases that summarises the content of a document. Continuous efforts have been dedicated to this task over the past few years, spreading across multiple lines of research, such as model architectures, data resources, and use-case scenarios. Yet, the current state of keyphrase generation remains unknown as there has been no attempt to review and analyse previous work. In this paper, we bridge this gap by presenting an analysis of over 50 research papers on keyphrase generation, offering a comprehensive overview of recent progress, limitations, and open challenges. Our findings highlight several critical issues in current evaluation practices, such as the concerning similarity among commonly-used benchmark datasets and inconsistencies in metric calculations leading to overestimated performances. Additionally, we address the limited availability of pre-trained models by releasing a strong PLM-based model for keyphrase generation as an effort to facilitate future research.

* GEM^2 paper @ ACL 2025

Via

Access Paper or Ask Questions

Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers

Jun 12, 2025

Xanh Ho, Sunisth Kumar, Yun-Ang Wu, Florian Boudin, Atsuhiro Takasu, Akiko Aizawa

Figure 1 for Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers

Figure 2 for Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers

Figure 3 for Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers

Figure 4 for Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers

Abstract:Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model's reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.

* 8 pages; code and data are available at https://github.com/Alab-NII/SciTabAlign

Via

Access Paper or Ask Questions

No Stupid Questions: An Analysis of Question Query Generation for Citation Recommendation

Jun 09, 2025

Brian D. Zimmerman, Julien Aubert-Béduchaud, Florian Boudin, Akiko Aizawa, Olga Vechtomova

Abstract:Existing techniques for citation recommendation are constrained by their adherence to article contents and metadata. We leverage GPT-4o-mini's latent expertise as an inquisitive assistant by instructing it to ask questions which, when answered, could expose new insights about an excerpt from a scientific article. We evaluate the utility of these questions as retrieval queries, measuring their effectiveness in retrieving and ranking masked target documents. In some cases, generated questions ended up being better queries than extractive keyword queries generated by the same model. We additionally propose MMR-RBO, a variation of Maximal Marginal Relevance (MMR) using Rank-Biased Overlap (RBO) to identify which questions will perform competitively with the keyword baseline. As all question queries yield unique result sets, we contend that there are no stupid questions.

* 6 pages, 5 figures, 2 tables

Via

Access Paper or Ask Questions

Identifying Reliable Evaluation Metrics for Scientific Text Revision

Jun 06, 2025

Léane Jourdan, Florian Boudin, Richard Dufour, Nicolas Hernandez

Abstract:Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.

* V1 contains only the English version, accepted to ACL 2025 main (26 pages). V2 contains both English (ACL 2025) and French (TALN 2025) versions (58 pages)

Via

Access Paper or Ask Questions

LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA

Apr 16, 2025

Xanh Ho, Jiahao Huang, Florian Boudin, Akiko Aizawa

Figure 1 for LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA

Figure 2 for LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA

Figure 3 for LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA

Figure 4 for LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA

Abstract:Extractive reading comprehension question answering (QA) datasets are typically evaluated using Exact Match (EM) and F1-score, but these metrics often fail to fully capture model performance. With the success of large language models (LLMs), they have been employed in various tasks, including serving as judges (LLM-as-a-judge). In this paper, we reassess the performance of QA models using LLM-as-a-judge across four reading comprehension QA datasets. We examine different families of LLMs and various answer types to evaluate the effectiveness of LLM-as-a-judge in these tasks. Our results show that LLM-as-a-judge is highly correlated with human judgments and can replace traditional EM/F1 metrics. By using LLM-as-a-judge, the correlation with human judgments improves significantly, from 0.17 (EM) and 0.36 (F1-score) to 0.85. These findings confirm that EM and F1 metrics underestimate the true performance of the QA models. While LLM-as-a-judge is not perfect for more difficult answer types (e.g., job), it still outperforms EM/F1, and we observe no bias issues, such as self-preference, when the same model is used for both the QA and judgment tasks.

* 17 pages; code and data are available at https://github.com/Alab-NII/llm-judge-extract-qa

Via

Access Paper or Ask Questions

ParaRev: Building a dataset for Scientific Paragraph Revision annotated with revision instruction

Jan 09, 2025

Léane Jourdan, Nicolas Hernandez, Richard Dufour, Florian Boudin, Akiko Aizawa

Abstract:Revision is a crucial step in scientific writing, where authors refine their work to improve clarity, structure, and academic quality. Existing approaches to automated writing assistance often focus on sentence-level revisions, which fail to capture the broader context needed for effective modification. In this paper, we explore the impact of shifting from sentence-level to paragraph-level scope for the task of scientific text revision. The paragraph level definition of the task allows for more meaningful changes, and is guided by detailed revision instructions rather than general ones. To support this task, we introduce ParaRev, the first dataset of revised scientific paragraphs with an evaluation subset manually annotated with revision instructions. Our experiments demonstrate that using detailed instructions significantly improves the quality of automated revisions compared to general approaches, no matter the model or the metric considered.

* Accepted at the WRAICogs 1 workoshop (co-located with Coling 2025)

Via

Access Paper or Ask Questions

Self-Compositional Data Augmentation for Scientific Keyphrase Generation

Nov 05, 2024

Mael Houbre, Florian Boudin, Beatrice Daille, Akiko Aizawa

Figure 1 for Self-Compositional Data Augmentation for Scientific Keyphrase Generation

Figure 2 for Self-Compositional Data Augmentation for Scientific Keyphrase Generation

Figure 3 for Self-Compositional Data Augmentation for Scientific Keyphrase Generation

Figure 4 for Self-Compositional Data Augmentation for Scientific Keyphrase Generation

Abstract:State-of-the-art models for keyphrase generation require large amounts of training data to achieve good performance. However, obtaining keyphrase-labeled documents can be challenging and costly. To address this issue, we present a self-compositional data augmentation method. More specifically, we measure the relatedness of training documents based on their shared keyphrases, and combine similar documents to generate synthetic samples. The advantage of our method lies in its ability to create additional training samples that keep domain coherence, without relying on external data or resources. Our results on multiple datasets spanning three different domains, demonstrate that our method consistently improves keyphrase generation. A qualitative analysis of the generated keyphrases for the Computer Science domain confirms this improvement towards their representativity property.

* Accepted to JCDL 2024 This version is not the final camera ready version

Via

Access Paper or Ask Questions