Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Timothee Mickus

Life Cycle-Aware Evaluation of Knowledge Distillation for Machine Translation: Environmental Impact and Translation Quality Trade-offs

Feb 10, 2026

Joseph Attieh, Timothee Mickus, Anne-Laure Ligozat, Aurélie Névéol, Jörg Tiedemann

Abstract:Knowledge distillation (KD) is a tool to compress a larger system (teacher) into a smaller one (student). In machine translation, studies typically report only the translation quality of the student and omit the computational complexity of performing KD, making it difficult to select among the many available KD choices under compute-induced constraints. In this study, we evaluate representative KD methods by considering both translation quality and computational cost. We express computational cost as a carbon footprint using the machine learning life cycle assessment (MLCA) tool. This assessment accounts for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference). We find that (i) distillation overhead dominates the total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation. Our protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints.

Via

Access Paper or Ask Questions

Unravelling the Mechanisms of Manipulating Numbers in Language Models

Oct 30, 2025

Michal Štefánik, Timothee Mickus, Marek Kadlčík, Bertram Højer, Michal Spiegel, Raúl Vázquez, Aman Sinha, Josef Kuchař, Philipp Mondorf

Figure 1 for Unravelling the Mechanisms of Manipulating Numbers in Language Models

Figure 2 for Unravelling the Mechanisms of Manipulating Numbers in Language Models

Figure 3 for Unravelling the Mechanisms of Manipulating Numbers in Language Models

Figure 4 for Unravelling the Mechanisms of Manipulating Numbers in Language Models

Abstract:Recent work has shown that different large language models (LLMs) converge to similar and accurate input embedding representations for numbers. These findings conflict with the documented propensity of LLMs to produce erroneous outputs when dealing with numeric information. In this work, we aim to explain this conflict by exploring how language models manipulate numbers and quantify the lower bounds of accuracy of these mechanisms. We find that despite surfacing errors, different language models learn interchangeable representations of numbers that are systematic, highly accurate and universal across their hidden states and the types of input contexts. This allows us to create universal probes for each LLM and to trace information -- including the causes of output errors -- to specific layers. Our results lay a fundamental understanding of how pre-trained LLMs manipulate numbers and outline the potential of more accurate probing techniques in addressed refinements of LLMs' architectures.

Via

Access Paper or Ask Questions

Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Oct 25, 2025

Federica Gamba, Aman Sinha, Timothee Mickus, Raul Vazquez, Patanjali Bhamidipati, Claudio Savelli, Ahana Chattopadhyay, Laura A. Zanella, Yash Kankanampati, Binesh Arakkal Remesh(+5 more)

Figure 1 for Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Figure 2 for Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Figure 3 for Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Figure 4 for Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection

Abstract:We introduce the CAP (Confabulations from ACL Publications) dataset, a multilingual resource for studying hallucinations in large language models (LLMs) within scientific text generation. CAP focuses on the scientific domain, where hallucinations can distort factual knowledge, as they frequently do. In this domain, however, the presence of specialized terminology, statistical reasoning, and context-dependent interpretations further exacerbates these distortions, particularly given LLMs' lack of true comprehension, limited contextual understanding, and bias toward surface-level generalization. CAP operates in a cross-lingual setting covering five high-resource languages (English, French, Hindi, Italian, and Spanish) and four low-resource languages (Bengali, Gujarati, Malayalam, and Telugu). The dataset comprises 900 curated scientific questions and over 7000 LLM-generated answers from 16 publicly available models, provided as question-answer pairs along with token sequences and corresponding logits. Each instance is annotated with a binary label indicating the presence of a scientific hallucination, denoted as a factuality error, and a fluency label, capturing issues in the linguistic quality or naturalness of the text. CAP is publicly released to facilitate advanced research on hallucination detection, multilingual evaluation of LLMs, and the development of more reliable scientific NLP systems.

Via

Access Paper or Ask Questions

Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers

Jun 10, 2025

Marek Kadlčík, Michal Štefánik, Timothee Mickus, Michal Spiegel, Josef Kuchař

Figure 1 for Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers

Figure 2 for Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers

Figure 3 for Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers

Figure 4 for Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers

Abstract:Pretrained language models (LMs) are prone to arithmetic errors. Existing work showed limited success in probing numeric values from models' representations, indicating that these errors can be attributed to the inherent unreliability of distributionally learned embeddings in representing exact quantities. However, we observe that previous probing methods are inadequate for the emergent structure of learned number embeddings with sinusoidal patterns. In response, we propose a novel probing technique that decodes numeric values from input embeddings with near-perfect accuracy across a range of open-source LMs. This proves that after the sole pre-training, LMs represent numbers with remarkable precision. Finally, we find that the embeddings' preciseness judged by our probe's accuracy explains a large portion of LM's errors in elementary arithmetic, and show that aligning the embeddings with the pattern discovered by our probe can mitigate these errors.

Via

Access Paper or Ask Questions

SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes

Apr 16, 2025

Raúl Vázquez, Timothee Mickus, Elaine Zosa, Teemu Vahtola, Jörg Tiedemann, Aman Sinha, Vincent Segonne, Fernando Sánchez-Vega, Alessandro Raganato, Jindřich Libovický(+8 more)

Abstract:We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs). Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies. The large number of submissions underscores the interest of the community in hallucination detection. We present the results of the participating systems and conduct an empirical analysis to identify key factors contributing to strong performance in this task. We also emphasize relevant current challenges, notably the varying degree of hallucinations across languages and the high annotator disagreement when labeling hallucination spans.

* Mu-SHROOM is part of SemEval-2025 (Task 3). TBP: Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Via

Access Paper or Ask Questions

Your Model is Overconfident, and Other Lies We Tell Ourselves

Mar 03, 2025

Timothee Mickus, Aman Sinha, Raúl Vázquez

Abstract:The difficulty intrinsic to a given example, rooted in its inherent ambiguity, is a key yet often overlooked factor in evaluating neural NLP models. We investigate the interplay and divergence among various metrics for assessing intrinsic difficulty, including annotator dissensus, training dynamics, and model confidence. Through a comprehensive analysis using 29 models on three datasets, we reveal that while correlations exist among these metrics, their relationships are neither linear nor monotonic. By disentangling these dimensions of uncertainty, we aim to refine our understanding of data complexity and its implications for evaluating and improving NLP models.

Via

Access Paper or Ask Questions

Two Stacks Are Better Than One: A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives

Jul 22, 2024

Zihao Li, Shaoxiong Ji, Timothee Mickus, Vincent Segonne, Jörg Tiedemann

Figure 1 for Two Stacks Are Better Than One: A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives

Figure 2 for Two Stacks Are Better Than One: A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives

Figure 3 for Two Stacks Are Better Than One: A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives

Figure 4 for Two Stacks Are Better Than One: A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives

Abstract:Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community. Establishing the best practices in pretraining has therefore become a major point of focus for much of NLP research -- especially since the insights developed for monolingual English models need not carry to more complex multilingual. One significant caveat of the current state of the art is that different works are rarely comparable: they often discuss different parameter counts, training data, and evaluation methodology. This paper proposes a comparison of multilingual pretraining objectives in a controlled methodological environment. We ensure that training data and model architectures are comparable, and discuss the downstream performances across 6 languages that we observe in probing and fine-tuning scenarios. We make two key observations: (1) the architecture dictates which pretraining objective is optimal; (2) multilingual translation is a very effective pre-training objective under the right conditions. We make our code, data, and model weights available at \texttt{\url{https://github.com/Helsinki-NLP/lm-vs-mt}}.

Via

Access Paper or Ask Questions

Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?

Jul 17, 2024

Aman Sinha, Timothee Mickus, Marianne Clausel, Mathieu Constant, Xavier Coubez

Figure 1 for Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?

Figure 2 for Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?

Figure 3 for Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?

Figure 4 for Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?

Abstract:The success of pretrained language models (PLMs) across a spate of use-cases has led to significant investment from the NLP community towards building domain-specific foundational models. On the other hand, in mission critical settings such as biomedical applications, other aspects also factor in-chief of which is a model's ability to produce reasonable estimates of its own uncertainty. In the present study, we discuss these two desiderata through the lens of how they shape the entropy of a model's output probability distribution. We find that domain specificity and uncertainty awareness can often be successfully combined, but the exact task at hand weighs in much more strongly.

* BioNLP 2024

Via

Access Paper or Ask Questions

AXOLOTL'24 Shared Task on Multilingual Explainable Semantic Change Modeling

Jul 04, 2024

Mariia Fedorova, Timothee Mickus, Niko Partanen, Janine Siewert, Elena Spaziani, Andrey Kutuzov

Figure 1 for AXOLOTL'24 Shared Task on Multilingual Explainable Semantic Change Modeling

Figure 2 for AXOLOTL'24 Shared Task on Multilingual Explainable Semantic Change Modeling

Figure 3 for AXOLOTL'24 Shared Task on Multilingual Explainable Semantic Change Modeling

Figure 4 for AXOLOTL'24 Shared Task on Multilingual Explainable Semantic Change Modeling

Abstract:This paper describes the organization and findings of AXOLOTL'24, the first multilingual explainable semantic change modeling shared task. We present new sense-annotated diachronic semantic change datasets for Finnish and Russian which were employed in the shared task, along with a surprise test-only German dataset borrowed from an existing source. The setup of AXOLOTL'24 is new to the semantic change modeling field, and involves subtasks of identifying unknown (novel) senses and providing dictionary-like definitions to these senses. The methods of the winning teams are described and compared, thus paving a path towards explainability in computational approaches to historical change of meaning.

* Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change (ACL'24)

Via

Access Paper or Ask Questions

I Have an Attention Bridge to Sell You: Generalization Capabilities of Modular Translation Architectures

Apr 30, 2024

Timothee Mickus, Raúl Vázquez, Joseph Attieh

Figure 1 for I Have an Attention Bridge to Sell You: Generalization Capabilities of Modular Translation Architectures

Figure 2 for I Have an Attention Bridge to Sell You: Generalization Capabilities of Modular Translation Architectures

Figure 3 for I Have an Attention Bridge to Sell You: Generalization Capabilities of Modular Translation Architectures

Figure 4 for I Have an Attention Bridge to Sell You: Generalization Capabilities of Modular Translation Architectures

Abstract:Modularity is a paradigm of machine translation with the potential of bringing forth models that are large at training time and small during inference. Within this field of study, modular approaches, and in particular attention bridges, have been argued to improve the generalization capabilities of models by fostering language-independent representations. In the present paper, we study whether modularity affects translation quality; as well as how well modular architectures generalize across different evaluation scenarios. For a given computational budget, we find non-modular architectures to be always comparable or preferable to all modular designs we study.

Via

Access Paper or Ask Questions