Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ondřej Bojar

Overview of the Sensemaking Task at the ELOQUENT 2025 Lab: LLMs as Teachers, Students and Evaluators

Jul 16, 2025

Pavel Šindelář, Ondřej Bojar

Abstract:ELOQUENT is a set of shared tasks that aims to create easily testable high-level criteria for evaluating generative language models. Sensemaking is one such shared task. In Sensemaking, we try to assess how well generative models ``make sense out of a given text'' in three steps inspired by exams in a classroom setting: (1) Teacher systems should prepare a set of questions, (2) Student systems should answer these questions, and (3) Evaluator systems should score these answers, all adhering rather strictly to a given set of input materials. We report on the 2025 edition of Sensemaking, where we had 7 sources of test materials (fact-checking analyses of statements, textbooks, transcribed recordings of a lecture, and educational videos) spanning English, German, Ukrainian, and Czech languages. This year, 4 teams participated, providing us with 2 Teacher submissions, 2 Student submissions, and 2 Evaluator submissions. We added baselines for Teacher and Student using commercial large language model systems. We devised a fully automatic evaluation procedure, which we compare to a minimalistic manual evaluation. We were able to make some interesting observations. For the first task, the creation of questions, better evaluation strategies will still have to be devised because it is difficult to discern the quality of the various candidate question sets. In the second task, question answering, the LLMs examined overall perform acceptably, but restricting their answers to the given input texts remains problematic. In the third task, evaluation of question answers, our adversarial tests reveal that systems using the LLM-as-a-Judge paradigm erroneously rate both garbled question-answer pairs and answers to mixed-up questions as acceptable.

* 30 pages, 7 figures, CLEF 2025 Conference and Labs of the Evaluation Forum

Via

Access Paper or Ask Questions

Prompting LLMs: Length Control for Isometric Machine Translation

Jun 05, 2025

Dávid Javorský, Ondřej Bojar, François Yvon

Figure 1 for Prompting LLMs: Length Control for Isometric Machine Translation

Figure 2 for Prompting LLMs: Length Control for Isometric Machine Translation

Figure 3 for Prompting LLMs: Length Control for Isometric Machine Translation

Figure 4 for Prompting LLMs: Length Control for Isometric Machine Translation

Abstract:In this study, we explore the effectiveness of isometric machine translation across multiple language pairs (En$\to$De, En$\to$Fr, and En$\to$Es) under the conditions of the IWSLT Isometric Shared Task 2022. Using eight open-source large language models (LLMs) of varying sizes, we investigate how different prompting strategies, varying numbers of few-shot examples, and demonstration selection influence translation quality and length control. We discover that the phrasing of instructions, when aligned with the properties of the provided demonstrations, plays a crucial role in controlling the output length. Our experiments show that LLMs tend to produce shorter translations only when presented with extreme examples, while isometric demonstrations often lead to the models disregarding length constraints. While few-shot prompting generally enhances translation quality, further improvements are marginal across 5, 10, and 20-shot settings. Finally, considering multiple outputs allows to notably improve overall tradeoff between the length and quality, yielding state-of-the-art performance for some language pairs.

* Accepted to IWSLT 2025

Via

Access Paper or Ask Questions

MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines

Jun 05, 2025

Dávid Javorský, Ondřej Bojar, François Yvon

Figure 1 for MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines

Figure 2 for MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines

Figure 3 for MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines

Figure 4 for MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines

Abstract:In simultaneous interpreting, an interpreter renders a source speech into another language with a very short lag, much sooner than sentences are finished. In order to understand and later reproduce this dynamic and complex task automatically, we need dedicated datasets and tools for analysis, monitoring, and evaluation, such as parallel speech corpora, and tools for their automatic annotation. Existing parallel corpora of translated texts and associated alignment algorithms hardly fill this gap, as they fail to model long-range interactions between speech segments or specific types of divergences (e.g., shortening, simplification, functional generalization) between the original and interpreted speeches. In this work, we introduce MockConf, a student interpreting dataset that was collected from Mock Conferences run as part of the students' curriculum. This dataset contains 7 hours of recordings in 5 European languages, transcribed and aligned at the level of spans and words. We further implement and release InterAlign, a modern web-based annotation tool for parallel word and span annotations on long inputs, suitable for aligning simultaneous interpreting. We propose metrics for the evaluation and a baseline for automatic alignment. Dataset and tools are released to the community.

* Accepted to ACL 2025 Main Conference

Via

Access Paper or Ask Questions

How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?

Dec 24, 2024

Sara Papi, Peter Polak, Ondřej Bojar, Dominik Macháček

Figure 1 for How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?

Figure 2 for How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?

Figure 3 for How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?

Figure 4 for How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?

Abstract:Simultaneous speech-to-text translation (SimulST) translates source-language speech into target-language text concurrently with the speaker's speech, ensuring low latency for better user comprehension. Despite its intended application to unbounded speech, most research has focused on human pre-segmented speech, simplifying the task and overlooking significant challenges. This narrow focus, coupled with widespread terminological inconsistencies, is limiting the applicability of research outcomes to real-world applications, ultimately hindering progress in the field. Our extensive literature review of 110 papers not only reveals these critical issues in current research but also serves as the foundation for our key contributions. We 1) define the steps and core components of a SimulST system, proposing a standardized terminology and taxonomy; 2) conduct a thorough analysis of community trends, and 3) offer concrete recommendations and future directions to bridge the gaps in existing literature, from evaluation frameworks to system architectures, for advancing the field towards more realistic and effective SimulST solutions.

* Accepted at TACL

Via

Access Paper or Ask Questions

Findings of the IWSLT 2024 Evaluation Campaign

Nov 07, 2024

Ibrahim Said Ahmad, Antonios Anastasopoulos, Ondřej Bojar, Claudia Borg, Marine Carpuat, Roldano Cattoni, Mauro Cettolo, William Chen, Qianqian Dong, Marcello Federico(+35 more)

Abstract:This paper reports on the shared tasks organized by the 21st IWSLT Conference. The shared tasks address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks attracted 18 teams whose submissions are documented in 26 system papers. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

* IWSLT 2024; 59 pages

Via

Access Paper or Ask Questions

Adversarial Testing as a Tool for Interpretability: Length-based Overfitting of Elementary Functions in Transformers

Oct 17, 2024

Patrik Zavoral, Dušan Variš, Ondřej Bojar

Figure 1 for Adversarial Testing as a Tool for Interpretability: Length-based Overfitting of Elementary Functions in Transformers

Figure 2 for Adversarial Testing as a Tool for Interpretability: Length-based Overfitting of Elementary Functions in Transformers

Figure 3 for Adversarial Testing as a Tool for Interpretability: Length-based Overfitting of Elementary Functions in Transformers

Figure 4 for Adversarial Testing as a Tool for Interpretability: Length-based Overfitting of Elementary Functions in Transformers

Abstract:The Transformer model has a tendency to overfit various aspects of the training data, such as the overall sequence length. We study elementary string edit functions using a defined set of error indicators to interpret the behaviour of the sequence-to-sequence Transformer. We show that generalization to shorter sequences is often possible, but confirm that longer sequences are highly problematic, although partially correct answers are often obtained. Additionally, we find that other structural characteristics of the sequences, such as subsegment length, may be equally important. We hypothesize that the models learn algorithmic aspects of the tasks simultaneously with structural aspects but adhering to the structural aspects is unfortunately often preferred by Transformer when they come into conflict.

* 9 pages, 8 figures, 2 tables; to be published

Via

Access Paper or Ask Questions

Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation

Jun 06, 2024

Matthias Sperber, Ondřej Bojar, Barry Haddow, Dávid Javorský, Xutai Ma, Matteo Negri, Jan Niehues, Peter Polák, Elizabeth Salesky, Katsuhito Sudoh(+1 more)

Figure 1 for Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation

Figure 2 for Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation

Figure 3 for Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation

Figure 4 for Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation

Abstract:Human evaluation is a critical component in machine translation system development and has received much attention in text translation research. However, little prior work exists on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation mismatches. We take first steps to fill this gap by conducting a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023). We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context. Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF, despite the segmentation noise introduced by the resegmentation step systems. We release the collected human-annotated data in order to encourage further investigation.

* Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
* LREC-COLING2024 publication (with corrections for Table 3)

Via

Access Paper or Ask Questions

Understanding the role of FFNs in driving multilingual behaviour in LLMs

Apr 22, 2024

Sunit Bhattacharya, Ondřej Bojar

Figure 1 for Understanding the role of FFNs in driving multilingual behaviour in LLMs

Figure 2 for Understanding the role of FFNs in driving multilingual behaviour in LLMs

Figure 3 for Understanding the role of FFNs in driving multilingual behaviour in LLMs

Figure 4 for Understanding the role of FFNs in driving multilingual behaviour in LLMs

Abstract:Multilingualism in Large Language Models (LLMs) is an yet under-explored area. In this paper, we conduct an in-depth analysis of the multilingual capabilities of a family of a Large Language Model, examining its architecture, activation patterns, and processing mechanisms across languages. We introduce novel metrics to probe the model's multilingual behaviour at different layers and shed light on the impact of architectural choices on multilingual processing. Our findings reveal different patterns of multilinugal processing in the sublayers of Feed-Forward Networks of the models. Furthermore, we uncover the phenomenon of "over-layerization" in certain model configurations, where increasing layer depth without corresponding adjustments to other parameters may degrade model performance. Through comparisons within and across languages, we demonstrate the interplay between model architecture, layer depth, and multilingual processing capabilities of LLMs trained on multiple languages.

* 10 pages

Via

Access Paper or Ask Questions

On Difficulties of Attention Factorization through Shared Memory

Mar 31, 2024

Uladzislau Yorsh, Martin Holeňa, Ondřej Bojar, David Herel

Abstract:Transformers have revolutionized deep learning in numerous fields, including natural language processing, computer vision, and audio processing. Their strength lies in their attention mechanism, which allows for the discovering of complex input relationships. However, this mechanism's quadratic time and memory complexity pose challenges for larger inputs. Researchers are now investigating models like Linear Unified Nested Attention (Luna) or Memory Augmented Transformer, which leverage external learnable memory to either reduce the attention computation complexity down to linear, or to propagate information between chunks in chunk-wise processing. Our findings challenge the conventional thinking on these models, revealing that interfacing with the memory directly through an attention operation is suboptimal, and that the performance may be considerably improved by filtering the input signal before communicating with memory.

* 2 pages of main content, 8 pages in total, published as a Tiny Paper at ICLR 2024

Via

Access Paper or Ask Questions

Quality and Quantity of Machine Translation References for Automated Metrics

Jan 08, 2024

Vilém Zouhar, Ondřej Bojar

Abstract:Automatic machine translation metrics often use human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.

Via

Access Paper or Ask Questions