Prior work on English monolingual retrieval has shown that a cross-encoder trained using a large number of relevance judgments for query-document pairs can be used as a teacher to train more efficient, but similarly effective, dual-encoder student models. Applying a similar knowledge distillation approach to training an efficient dual-encoder model for Cross-Language Information Retrieval (CLIR), where queries and documents are in different languages, is challenging due to the lack of a sufficiently large training collection when the query and document languages differ. The state of the art for CLIR thus relies on translating queries, documents, or both from the large English MS MARCO training set, an approach called Translate-Train. This paper proposes an alternative, Translate-Distill, in which knowledge distillation from either a monolingual cross-encoder or a CLIR cross-encoder is used to train a dual-encoder CLIR student model. This richer design space enables the teacher model to perform inference in an optimized setting, while training the student model directly for CLIR. Trained models and artifacts are publicly available on Huggingface.
A key stumbling block for neural cross-language information retrieval (CLIR) systems has been the paucity of training data. The appearance of the MS MARCO monolingual training set led to significant advances in the state of the art in neural monolingual retrieval. By translating the MS MARCO documents into other languages using machine translation, this resource has been made useful to the CLIR community. Yet such translation suffers from a number of problems. While MS MARCO is a large resource, it is of fixed size; its genre and domain of discourse are fixed; and the translated documents are not written in the language of a native speaker of the language, but rather in translationese. To address these problems, we introduce the JH-POLO CLIR training set creation methodology. The approach begins by selecting a pair of non-English passages. A generative large language model is then used to produce an English query for which the first passage is relevant and the second passage is not relevant. By repeating this process, collections of arbitrary size can be created in the style of MS MARCO but using naturally-occurring documents in any desired genre and domain of discourse. This paper describes the methodology in detail, shows its use in creating new CLIR training sets, and describes experiments using the newly created training data.
This is the first year of the TREC Neural CLIR (NeuCLIR) track, which aims to study the impact of neural approaches to cross-language information retrieval. The main task in this year's track was ad hoc ranked retrieval of Chinese, Persian, or Russian newswire documents using queries expressed in English. Topics were developed using standard TREC processes, except that topics developed by an annotator for one language were assessed by a different annotator when evaluating that topic on a different language. There were 172 total runs submitted by twelve teams.
* 22 pages, 13 figures, 10 tables. Part of the Thirty-First Text
REtrieval Conference (TREC 2022) Proceedings
A popular approach to creating a zero-shot cross-language retrieval model is to substitute a monolingual pretrained language model in the retrieval model with a multilingual pretrained language model such as Multilingual BERT. This multilingual model is fined-tuned to the retrieval task with monolingual data such as English MS MARCO using the same training recipe as the monolingual retrieval model used. However, such transferred models suffer from mismatches in the languages of the input text during training and inference. In this work, we propose transferring monolingual retrieval models using adapters, a parameter-efficient component for a transformer network. By adding adapters pretrained on language tasks for a specific language with task-specific adapters, prior work has shown that the adapter-enhanced models perform better than fine-tuning the entire model when transferring across languages in various NLP tasks. By constructing dense retrieval models with adapters, we show that models trained with monolingual data are more effective than fine-tuning the entire model when transferring to a Cross Language Information Retrieval (CLIR) setting. However, we found that the prior suggestion of replacing the language adapters to match the target language at inference time is suboptimal for dense retrieval models. We provide an in-depth analysis of this discrepancy between other cross-language NLP tasks and CLIR.
ColBERT-X is a dense retrieval model for Cross Language Information Retrieval (CLIR). In CLIR, documents are written in one natural language, while the queries are expressed in another. A related task is multilingual IR (MLIR) where the system creates a single ranked list of documents written in many languages. Given that ColBERT-X relies on a pretrained multilingual neural language model to rank documents, a multilingual training procedure can enable a version of ColBERT-X well-suited for MLIR. This paper describes that training procedure. An important factor for good MLIR ranking is fine-tuning XLM-R using mixed-language batches, where the same query is matched with documents in different languages in the same batch. Neural machine translations of MS MARCO passages are used to fine-tune the model.
While there are high-quality software frameworks for information retrieval experimentation, they do not explicitly support cross-language information retrieval (CLIR). To fill this gap, we have created Patapsco, a Python CLIR framework. This framework specifically addresses the complexity that comes with running experiments in multiple languages. Patapsco is designed to be extensible to many language pairs, to be scalable to large document collections, and to support reproducible experiments driven by a configuration file. We include Patapsco results on standard CLIR collections using multiple settings.
HC4 is a new suite of test collections for ad hoc Cross-Language Information Retrieval (CLIR), with Common Crawl News documents in Chinese, Persian, and Russian, topics in English and in the document languages, and graded relevance judgments. New test collections are needed because existing CLIR test collections built using pooling of traditional CLIR runs have systematic gaps in their relevance judgments when used to evaluate neural CLIR methods. The HC4 collections contain 60 topics and about half a million documents for each of Chinese and Persian, and 54 topics and five million documents for Russian. Active learning was used to determine which documents to annotate after being seeded using interactive search and judgment. Documents were judged on a three-grade relevance scale. This paper describes the design and construction of the new test collections and provides baseline results for demonstrating their utility for evaluating systems.
The advent of transformer-based models such as BERT has led to the rise of neural ranking models. These models have improved the effectiveness of retrieval systems well beyond that of lexical term matching models such as BM25. While monolingual retrieval tasks have benefited from large-scale training collections such as MS MARCO and advances in neural architectures, cross-language retrieval tasks have fallen behind these advancements. This paper introduces ColBERT-X, a generalization of the ColBERT multi-representation dense retrieval model that uses the XLM-RoBERTa (XLM-R) encoder to support cross-language information retrieval (CLIR). ColBERT-X can be trained in two ways. In zero-shot training, the system is trained on the English MS MARCO collection, relying on the XLM-R encoder for cross-language mappings. In translate-train, the system is trained on the MS MARCO English queries coupled with machine translations of the associated MS MARCO passages. Results on ad hoc document ranking tasks in several languages demonstrate substantial and statistically significant improvements of these trained dense retrieval models over traditional lexical CLIR baselines.
Entity linking -- the task of identifying references in free text to relevant knowledge base representations -- often focuses on single languages. We consider multilingual entity linking, where a single model is trained to link references to same-language knowledge bases in several languages. We propose a neural ranker architecture, which leverages multilingual transformer representations of text to be easily applied to a multilingual setting. We then explore how a neural ranker trained in one language (e.g. English) transfers to an unseen language (e.g. Chinese), and find that while there is a consistent but not large drop in performance. How can this drop in performance be alleviated? We explore adding an adversarial objective to force our model to learn language-invariant representations. We find that using this approach improves recall in several datasets, often matching the in-language performance, thus alleviating some of the performance loss occurring from zero-shot transfer.
Cross-language entity linking grounds mentions in multiple languages to a single-language knowledge base. We propose a neural ranking architecture for this task that uses multilingual BERT representations of the mention and the context in a neural network. We find that the multilingual ability of BERT leads to robust performance in monolingual and multilingual settings. Furthermore, we explore zero-shot language transfer and find surprisingly robust performance. We investigate the zero-shot degradation and find that it can be partially mitigated by a proposed auxiliary training objective, but that the remaining error can best be attributed to domain shift rather than language transfer.