Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Natalia Skachkova

When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

Jul 28, 2025

Hanna Shcharbakova, Tatiana Anikina, Natalia Skachkova, Josef van Genabith

Figure 1 for When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

Figure 2 for When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

Figure 3 for When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

Figure 4 for When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

Abstract:The rapid spread of multilingual misinformation requires robust automated fact verification systems capable of handling fine-grained veracity assessments across diverse languages. While large language models have shown remarkable capabilities across many NLP tasks, their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied. We conduct a comprehensive evaluation of five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories. Our experiments compare small language models (encoder-based XLM-R and mT5) with recent decoder-only LLMs (Llama 3.1, Qwen 2.5, Mistral Nemo) using both prompting and fine-tuning approaches. Surprisingly, we find that XLM-R (270M parameters) substantially outperforms all tested LLMs (7-12B parameters), achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%. This represents a 15.8% improvement over the previous state-of-the-art (41.9%), establishing new performance benchmarks for multilingual fact verification. Our analysis reveals problematic patterns in LLM behavior, including systematic difficulties in leveraging evidence and pronounced biases toward frequent categories in imbalanced data settings. These findings suggest that for fine-grained multilingual fact verification, smaller specialized models may be more effective than general-purpose large models, with important implications for practical deployment of fact-checking systems.

* Published at the FEVER Workshop, ACL 2025

Via

Access Paper or Ask Questions

Anaphora Resolution in Dialogue: System Description (CODI-CRAC 2022 Shared Task)

Jan 05, 2023

Tatiana Anikina, Natalia Skachkova, Joseph Renner, Priyansh Trivedi

Figure 1 for Anaphora Resolution in Dialogue: System Description (CODI-CRAC 2022 Shared Task)

Figure 2 for Anaphora Resolution in Dialogue: System Description (CODI-CRAC 2022 Shared Task)

Figure 3 for Anaphora Resolution in Dialogue: System Description (CODI-CRAC 2022 Shared Task)

Figure 4 for Anaphora Resolution in Dialogue: System Description (CODI-CRAC 2022 Shared Task)

Abstract:We describe three models submitted for the CODI-CRAC 2022 shared task. To perform identity anaphora resolution, we test several combinations of the incremental clustering approach based on the Workspace Coreference System (WCS) with other coreference models. The best result is achieved by adding the ''cluster merging'' version of the coref-hoi model, which brings up to 10.33% improvement 1 over vanilla WCS clustering. Discourse deixis resolution is implemented as multi-task learning: we combine the learning objective of corefhoi with anaphor type classification. We adapt the higher-order resolution model introduced in Joshi et al. (2019) for bridging resolution given gold mentions and anaphors.

* CODI-CRAC 2022, Oct 2022, Gyeongju, South Korea

Via

Access Paper or Ask Questions