Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mikołaj Pokrywka

ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

Jun 05, 2025

Mikołaj Pokrywka, Wojciech Kusa, Mieszko Rutkowski, Mikołaj Koszowski

Figure 1 for ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

Figure 2 for ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

Figure 3 for ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

Figure 4 for ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT

Abstract:Neural Machine Translation (NMT) has improved translation by using Transformer-based models, but it still struggles with word ambiguity and context. This problem is especially important in domain-specific applications, which often have problems with unclear sentences or poor data quality. Our research explores how adding information to models can improve translations in the context of e-commerce data. To this end we create ConECT -- a new Czech-to-Polish e-commerce product translation dataset coupled with images and product metadata consisting of 11,400 sentence pairs. We then investigate and compare different methods that are applicable to context-aware translation. We test a vision-language model (VLM), finding that visual context aids translation quality. Additionally, we explore the incorporation of contextual information into text-to-text models, such as the product's category path or image descriptions. The results of our study demonstrate that the incorporation of contextual information leads to an improvement in the quality of machine translation. We make the new dataset publicly available.

* Accepted at ACL 2025 (The 63rd Annual Meeting of the Association for Computational Linguistics)

Via

Access Paper or Ask Questions

MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality

Feb 20, 2025

Artur Kot, Mikołaj Koszowski, Wojciech Chojnowski, Mieszko Rutkowski, Artur Nowakowski, Kamil Guttmann, Mikołaj Pokrywka

Figure 1 for MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality

Figure 2 for MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality

Figure 3 for MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality

Figure 4 for MultiSlav: Using Cross-Lingual Knowledge Transfer to Combat the Curse of Multilinguality

Abstract:Does multilingual Neural Machine Translation (NMT) lead to The Curse of the Multlinguality or provides the Cross-lingual Knowledge Transfer within a language family? In this study, we explore multiple approaches for extending the available data-regime in NMT and we prove cross-lingual benefits even in 0-shot translation regime for low-resource languages. With this paper, we provide state-of-the-art open-source NMT models for translating between selected Slavic languages. We released our models on the HuggingFace Hub (https://hf.co/collections/allegro/multislav-6793d6b6419e5963e759a683) under the CC BY 4.0 license. Slavic language family comprises morphologically rich Central and Eastern European languages. Although counting hundreds of millions of native speakers, Slavic Neural Machine Translation is under-studied in our opinion. Recently, most NMT research focuses either on: high-resource languages like English, Spanish, and German - in WMT23 General Translation Task 7 out of 8 task directions are from or to English; massively multilingual models covering multiple language groups; or evaluation techniques.

Via

Access Paper or Ask Questions

Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation

May 20, 2024

Kamil Guttmann, Mikołaj Pokrywka, Adrian Charkiewicz, Artur Nowakowski

Figure 1 for Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation

Figure 2 for Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation

Figure 3 for Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation

Figure 4 for Chasing COMET: Leveraging Minimum Bayes Risk Decoding for Self-Improving Machine Translation

Abstract:This paper explores Minimum Bayes Risk (MBR) decoding for self-improvement in machine translation (MT), particularly for domain adaptation and low-resource languages. We implement the self-improvement process by fine-tuning the model on its MBR-decoded forward translations. By employing COMET as the MBR utility metric, we aim to achieve the reranking of translations that better aligns with human preferences. The paper explores the iterative application of this approach and the potential need for language-specific MBR utility metrics. The results demonstrate significant enhancements in translation quality for all examined language pairs, including successful application to domain-adapted models and generalisation to low-resource settings. This highlights the potential of COMET-guided MBR for efficient MT self-improvement in various scenarios.

* EAMT 2024

Via

Access Paper or Ask Questions

Adam Mickiewicz University at WMT 2022: NER-Assisted and Quality-Aware Neural Machine Translation

Sep 07, 2022

Artur Nowakowski, Gabriela Pałka, Kamil Guttmann, Mikołaj Pokrywka

Figure 1 for Adam Mickiewicz University at WMT 2022: NER-Assisted and Quality-Aware Neural Machine Translation

Figure 2 for Adam Mickiewicz University at WMT 2022: NER-Assisted and Quality-Aware Neural Machine Translation

Figure 3 for Adam Mickiewicz University at WMT 2022: NER-Assisted and Quality-Aware Neural Machine Translation

Abstract:This paper presents Adam Mickiewicz University's (AMU) submissions to the constrained track of the WMT 2022 General MT Task. We participated in the Ukrainian $\leftrightarrow$ Czech translation directions. The systems are a weighted ensemble of four models based on the Transformer (big) architecture. The models use source factors to utilize the information about named entities present in the input. Each of the models in the ensemble was trained using only the data provided by the shared task organizers. A noisy back-translation technique was used to augment the training corpora. One of the models in the ensemble is a document-level model, trained on parallel and synthetic longer sequences. During the sentence-level decoding process, the ensemble generated the n-best list. The n-best list was merged with the n-best list generated by a single document-level model which translated multiple sentences at a time. Finally, existing quality estimation models and minimum Bayes risk decoding were used to rerank the n-best list so that the best hypothesis was chosen according to the COMET evaluation metric. According to the automatic evaluation results, our systems rank first in both translation directions.

* WMT 2022

Via

Access Paper or Ask Questions