Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vinay Setty

Multilingual Fact-Checking at Scale: Fine-Tuned Compact Models vs LLMs

Jun 07, 2026

Pratuat Amatya, Vinay Setty

Abstract:We present a multilingual fact-checking system deployed at Factiverse, designed for high-throughput and low-latency operation across diverse languages. The system follows a modular pipeline with three stages: claim detection, evidence retrieval and re-ranking, and veracity prediction. We fine-tune XLM-RoBERTa-Large for claim detection, mmBERT-base for three-label stance classification (Supports/Refutes/Mixed), and a SetFit-based multilingual re-ranker for claim--evidence matching. We compare these components against strong LLM baselines, including GPT-5.2, Claude Opus~4.6, and Qwen3-8b. Experiments on production data spanning 114 languages for claim detection and 28 languages for veracity prediction show that task-specific fine-tuning provides strong and stable multilingual performance, while the fine-tuned retrieval model remains competitive with modern proprietary embeddings. Same-hardware latency measurements further show large efficiency gains for encoder-based components, supporting their use in production deployments with tight cost and privacy constraints. Overall, compact fine-tuned, self-hosted models remain a practical and effective foundation for multilingual fact-checking at scale. Code and data used for this study are available at https://github.com/factiverse/factcheck-editor.

Via

Access Paper or Ask Questions

The CLEF-2026 CheckThat! Lab: Advancing Multilingual Fact-Checking

Feb 10, 2026

Julia Maria Struß, Sebastian Schellhammer, Stefan Dietze, Venktesh V, Vinay Setty, Tanmoy Chakraborty, Preslav Nakov, Avishek Anand, Primakov Chungkham, Salim Hafid(+2 more)

Abstract:The CheckThat! lab aims to advance the development of innovative technologies combating disinformation and manipulation efforts in online communication across a multitude of languages and platforms. While in early editions the focus has been on core tasks of the verification pipeline (check-worthiness, evidence retrieval, and verification), in the past three editions, the lab added additional tasks linked to the verification process. In this year's edition, the verification pipeline is at the center again with the following tasks: Task 1 on source retrieval for scientific web claims (a follow-up of the 2025 edition), Task 2 on fact-checking numerical and temporal claims, which adds a reasoning component to the 2025 edition, and Task 3, which expands the verification pipeline with generation of full-fact-checking articles. These tasks represent challenging classification and retrieval problems as well as generation challenges at the document and span level, including multilingual settings.

* misinformation, disinformation, fact-checking, claim source retrieval, generating fact-checking articles

Via

Access Paper or Ask Questions

One LLM to Train Them All: Multi-Task Learning Framework for Fact-Checking

Jan 16, 2026

Malin Astrid Larsson, Harald Fosen Grunnaleite, Vinay Setty

Abstract:Large language models (LLMs) are reshaping automated fact-checking (AFC) by enabling unified, end-to-end verification pipelines rather than isolated components. While large proprietary models achieve strong performance, their closed weights, complexity, and high costs limit sustainability. Fine-tuning smaller open weight models for individual AFC tasks can help but requires multiple specialized models resulting in high costs. We propose \textbf{multi-task learning (MTL)} as a more efficient alternative that fine-tunes a single model to perform claim detection, evidence ranking, and stance detection jointly. Using small decoder-only LLMs (e.g., Qwen3-4b), we explore three MTL strategies: classification heads, causal language modeling heads, and instruction-tuning, and evaluate them across model sizes, task orders, and standard non-LLM baselines. While multitask models do not universally surpass single-task baselines, they yield substantial improvements, achieving up to \textbf{44\%}, \textbf{54\%}, and \textbf{31\%} relative gains for claim detection, evidence re-ranking, and stance detection, respectively, over zero-/few-shot settings. Finally, we also provide practical, empirically grounded guidelines to help practitioners apply MTL with LLMs for automated fact-checking.

* Accepted version in ECIR 2026

Via

Access Paper or Ask Questions

NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction

Nov 13, 2025

Peter Røysland Aarnes, Vinay Setty

Abstract:Large language models show strong performance on knowledge intensive tasks such as fact-checking and question answering, yet they often struggle with numerical reasoning. We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs using controlled perturbations, including label-flipping probes, to test robustness. Our results indicate that even leading proprietary systems experience accuracy drops of up to 62\% under certain perturbations. No model proves to be robust across all conditions. We further find that increasing context length generally reduces accuracy, but when extended context is enriched with perturbed demonstrations, most models substantially recover. These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.

* Accepted in ICJNLP/AACL SRW

Via

Access Paper or Ask Questions

Think Right, Not More: Test-Time Scaling for Numerical Claim Verification

Sep 26, 2025

Primakov Chungkham, V Venktesh, Vinay Setty, Avishek Anand

Figure 1 for Think Right, Not More: Test-Time Scaling for Numerical Claim Verification

Figure 2 for Think Right, Not More: Test-Time Scaling for Numerical Claim Verification

Figure 3 for Think Right, Not More: Test-Time Scaling for Numerical Claim Verification

Figure 4 for Think Right, Not More: Test-Time Scaling for Numerical Claim Verification

Abstract:Fact-checking real-world claims, particularly numerical claims, is inherently complex that require multistep reasoning and numerical reasoning for verifying diverse aspects of the claim. Although large language models (LLMs) including reasoning models have made tremendous advances, they still fall short on fact-checking real-world claims that require a combination of compositional and numerical reasoning. They are unable to understand nuance of numerical aspects, and are also susceptible to the reasoning drift issue, where the model is unable to contextualize diverse information resulting in misinterpretation and backtracking of reasoning process. In this work, we systematically explore scaling test-time compute (TTS) for LLMs on the task of fact-checking complex numerical claims, which entails eliciting multiple reasoning paths from an LLM. We train a verifier model (VERIFIERFC) to navigate this space of possible reasoning paths and select one that could lead to the correct verdict. We observe that TTS helps mitigate the reasoning drift issue, leading to significant performance gains for fact-checking numerical claims. To improve compute efficiency in TTS, we introduce an adaptive mechanism that performs TTS selectively based on the perceived complexity of the claim. This approach achieves 1.8x higher efficiency than standard TTS, while delivering a notable 18.8% performance improvement over single-shot claim verification methods. Our code and data can be found at https://github.com/VenkteshV/VerifierFC

* Accepted to EMNLP 2025, 19 pages

Via

Access Paper or Ask Questions

The CLEF-2025 CheckThat! Lab: Subjectivity, Fact-Checking, Claim Normalization, and Retrieval

Mar 19, 2025

Firoj Alam, Julia Maria Struß, Tanmoy Chakraborty, Stefan Dietze, Salim Hafid, Katerina Korre, Arianna Muti, Preslav Nakov, Federico Ruggeri, Sebastian Schellhammer(+4 more)

Abstract:The CheckThat! lab aims to advance the development of innovative technologies designed to identify and counteract online disinformation and manipulation efforts across various languages and platforms. The first five editions focused on key tasks in the information verification pipeline, including check-worthiness, evidence retrieval and pairing, and verification. Since the 2023 edition, the lab has expanded its scope to address auxiliary tasks that support research and decision-making in verification. In the 2025 edition, the lab revisits core verification tasks while also considering auxiliary challenges. Task 1 focuses on the identification of subjectivity (a follow-up from CheckThat! 2024), Task 2 addresses claim normalization, Task 3 targets fact-checking numerical claims, and Task 4 explores scientific web discourse processing. These tasks present challenging classification and retrieval problems at both the document and span levels, including multilingual settings.

* misinformation, factuality, fact-checking, fact-checkers, check-worthiness, Social Media Platforms

Via

Access Paper or Ask Questions

FactIR: A Real-World Zero-shot Open-Domain Retrieval Benchmark for Fact-Checking

Feb 09, 2025

Venktesh V, Vinay Setty

Figure 1 for FactIR: A Real-World Zero-shot Open-Domain Retrieval Benchmark for Fact-Checking

Figure 2 for FactIR: A Real-World Zero-shot Open-Domain Retrieval Benchmark for Fact-Checking

Figure 3 for FactIR: A Real-World Zero-shot Open-Domain Retrieval Benchmark for Fact-Checking

Abstract:The field of automated fact-checking increasingly depends on retrieving web-based evidence to determine the veracity of claims in real-world scenarios. A significant challenge in this process is not only retrieving relevant information, but also identifying evidence that can both support and refute complex claims. Traditional retrieval methods may return documents that directly address claims or lean toward supporting them, but often struggle with more complex claims requiring indirect reasoning. While some existing benchmarks and methods target retrieval for fact-checking, a comprehensive real-world open-domain benchmark has been lacking. In this paper, we present a real-world retrieval benchmark FactIR, derived from Factiverse production logs, enhanced with human annotations. We rigorously evaluate state-of-the-art retrieval models in a zero-shot setup on FactIR and offer insights for developing practical retrieval systems for fact-checking. Code and data are available at https://github.com/factiverse/factIR.

* Accepted to WWW 2025 resource track

Via

Access Paper or Ask Questions

FlashCheck: Exploration of Efficient Evidence Retrieval for Fast Fact-Checking

Feb 09, 2025

Kevin Nanekhan, Venktesh V, Erik Martin, Henrik Vatndal, Vinay Setty, Avishek Anand

Figure 1 for FlashCheck: Exploration of Efficient Evidence Retrieval for Fast Fact-Checking

Figure 2 for FlashCheck: Exploration of Efficient Evidence Retrieval for Fast Fact-Checking

Figure 3 for FlashCheck: Exploration of Efficient Evidence Retrieval for Fast Fact-Checking

Figure 4 for FlashCheck: Exploration of Efficient Evidence Retrieval for Fast Fact-Checking

Abstract:The advances in digital tools have led to the rampant spread of misinformation. While fact-checking aims to combat this, manual fact-checking is cumbersome and not scalable. It is essential for automated fact-checking to be efficient for aiding in combating misinformation in real-time and at the source. Fact-checking pipelines primarily comprise a knowledge retrieval component which extracts relevant knowledge to fact-check a claim from large knowledge sources like Wikipedia and a verification component. The existing works primarily focus on the fact-verification part rather than evidence retrieval from large data collections, which often face scalability issues for practical applications such as live fact-checking. In this study, we address this gap by exploring various methods for indexing a succinct set of factual statements from large collections like Wikipedia to enhance the retrieval phase of the fact-checking pipeline. We also explore the impact of vector quantization to further improve the efficiency of pipelines that employ dense retrieval approaches for first-stage retrieval. We study the efficiency and effectiveness of the approaches on fact-checking datasets such as HoVer and WiCE, leveraging Wikipedia as the knowledge source. We also evaluate the real-world utility of the efficient retrieval approaches by fact-checking 2024 presidential debate and also open source the collection of claims with corresponding labels identified in the debate. Through a combination of indexed facts together with Dense retrieval and Index compression, we achieve up to a 10.0x speedup on CPUs and more than a 20.0x speedup on GPUs compared to the classical fact-checking pipelines over large collections.

* Accepted to ECIR 2024, 15 pages

Via

Access Paper or Ask Questions

Annotation Tool and Dataset for Fact-Checking Podcasts

Feb 03, 2025

Vinay Setty, Adam James Becker

Figure 1 for Annotation Tool and Dataset for Fact-Checking Podcasts

Figure 2 for Annotation Tool and Dataset for Fact-Checking Podcasts

Figure 3 for Annotation Tool and Dataset for Fact-Checking Podcasts

Figure 4 for Annotation Tool and Dataset for Fact-Checking Podcasts

Abstract:Podcasts are a popular medium on the web, featuring diverse and multilingual content that often includes unverified claims. Fact-checking podcasts is a challenging task, requiring transcription, annotation, and claim verification, all while preserving the contextual details of spoken content. Our tool offers a novel approach to tackle these challenges by enabling real-time annotation of podcasts during playback. This unique capability allows users to listen to the podcast and annotate key elements, such as check-worthy claims, claim spans, and contextual errors, simultaneously. By integrating advanced transcription models like OpenAI's Whisper and leveraging crowdsourced annotations, we create high-quality datasets to fine-tune multilingual transformer models such as XLM-RoBERTa for tasks like claim detection and stance classification. Furthermore, we release the annotated podcast transcripts and sample annotations with preliminary experiments.

* Accepted as resource paper in TheWebConf 2025

Via

Access Paper or Ask Questions

DISCO: DISCovering Overfittings as Causal Rules for Text Classification Models

Nov 07, 2024

Zijian Zhang, Vinay Setty, Yumeng Wang, Avishek Anand

Figure 1 for DISCO: DISCovering Overfittings as Causal Rules for Text Classification Models

Figure 2 for DISCO: DISCovering Overfittings as Causal Rules for Text Classification Models

Figure 3 for DISCO: DISCovering Overfittings as Causal Rules for Text Classification Models

Figure 4 for DISCO: DISCovering Overfittings as Causal Rules for Text Classification Models

Abstract:With the rapid advancement of neural language models, the deployment of over-parameterized models has surged, increasing the need for interpretable explanations comprehensible to human inspectors. Existing post-hoc interpretability methods, which often focus on unigram features of single input textual instances, fail to capture the models' decision-making process fully. Additionally, many methods do not differentiate between decisions based on spurious correlations and those based on a holistic understanding of the input. Our paper introduces DISCO, a novel method for discovering global, rule-based explanations by identifying causal n-gram associations with model predictions. This method employs a scalable sequence mining technique to extract relevant text spans from training data, associate them with model predictions, and conduct causality checks to distill robust rules that elucidate model behavior. These rules expose potential overfitting and provide insights into misleading feature combinations. We validate DISCO through extensive testing, demonstrating its superiority over existing methods in offering comprehensive insights into complex model behaviors. Our approach successfully identifies all shortcuts manually introduced into the training data (100% detection rate on the MultiRC dataset), resulting in an 18.8% regression in model performance -- a capability unmatched by any other method. Furthermore, DISCO supports interactive explanations, enabling human inspectors to distinguish spurious causes in the rule-based output. This alleviates the burden of abundant instance-wise explanations and helps assess the model's risk when encountering out-of-distribution (OOD) data.

Via

Access Paper or Ask Questions