Alert button
Picture for Philipp Dufter

Philipp Dufter

Alert button

Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages

Jan 28, 2022
Silvia Severini, Ayyoob Imani, Philipp Dufter, Hinrich Schütze

Figure 1 for Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages
Figure 2 for Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages
Figure 3 for Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages
Figure 4 for Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages

Parallel corpora are ideal for extracting a multilingual named entity (MNE) resource, i.e., a dataset of names translated into multiple languages. Prior work on extracting MNE datasets from parallel corpora required resources such as large monolingual corpora or word aligners that are unavailable or perform poorly for underresourced languages. We present CLC-BN, a new method for creating an MNE resource, and apply it to the Parallel Bible Corpus, a corpus of more than 1000 languages. CLC-BN learns a neural transliteration model from parallel-corpus statistics, without requiring any other bilingual resources, word aligners, or seed data. Experimental results show that CLC-BN clearly outperforms prior work. We release an MNE resource for 1340 languages and demonstrate its effectiveness in two downstream tasks: knowledge graph augmentation and bilingual lexicon induction.

Viaarxiv icon

BERT Cannot Align Characters

Sep 20, 2021
Antonis Maronikolakis, Philipp Dufter, Hinrich Schütze

Figure 1 for BERT Cannot Align Characters
Figure 2 for BERT Cannot Align Characters
Figure 3 for BERT Cannot Align Characters
Figure 4 for BERT Cannot Align Characters

In previous work, it has been shown that BERT can adequately align cross-lingual sentences on the word level. Here we investigate whether BERT can also operate as a char-level aligner. The languages examined are English, Fake-English, German and Greek. We show that the closer two languages are, the better BERT can align them on the character level. BERT indeed works well in English to Fake-English alignment, but this does not generalize to natural languages to the same extent. Nevertheless, the proximity of two languages does seem to be a factor. English is more related to German than to Greek and this is reflected in how well BERT aligns them; English to German is better than English to Greek. We examine multiple setups and show that the similarity matrices for natural languages show weaker relations the further apart two languages are.

* Second Workshop on Insights from Negative Results, EMNLP 2021 
Viaarxiv icon

Locating Language-Specific Information in Contextualized Embeddings

Sep 16, 2021
Sheng Liang, Philipp Dufter, Hinrich Schütze

Figure 1 for Locating Language-Specific Information in Contextualized Embeddings
Figure 2 for Locating Language-Specific Information in Contextualized Embeddings
Figure 3 for Locating Language-Specific Information in Contextualized Embeddings
Figure 4 for Locating Language-Specific Information in Contextualized Embeddings

Multilingual pretrained language models (MPLMs) exhibit multilinguality and are well suited for transfer across languages. Most MPLMs are trained in an unsupervised fashion and the relationship between their objective and multilinguality is unclear. More specifically, the question whether MPLM representations are language-agnostic or they simply interleave well with learned task prediction heads arises. In this work, we locate language-specific information in MPLMs and identify its dimensionality and the layers where this information occurs. We show that language-specific information is scattered across many dimensions, which can be projected into a linear subspace. Our study contributes to a better understanding of MPLM representations, going beyond treating them as unanalyzable blobs of information.

Viaarxiv icon

Graph Algorithms for Multiparallel Word Alignment

Sep 13, 2021
Ayyoob Imani, Masoud Jalili Sabet, Lütfi Kerem Şenel, Philipp Dufter, François Yvon, Hinrich Schütze

Figure 1 for Graph Algorithms for Multiparallel Word Alignment
Figure 2 for Graph Algorithms for Multiparallel Word Alignment
Figure 3 for Graph Algorithms for Multiparallel Word Alignment
Figure 4 for Graph Algorithms for Multiparallel Word Alignment

With the advent of end-to-end deep learning approaches in machine translation, interest in word alignments initially decreased; however, they have again become a focus of research more recently. Alignments are useful for typological research, transferring formatting like markup to translated texts, and can be used in the decoding of machine translation systems. At the same time, massively multilingual processing is becoming an important NLP scenario, and pretrained language and machine translation models that are truly multilingual are proposed. However, most alignment algorithms rely on bitexts only and do not leverage the fact that many parallel corpora are multiparallel. In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph and then predicting additional edges in the graph. We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction. Our experimental results show absolute improvements in $F_1$ of up to 28% over the baseline bilingual word aligner in different datasets.

* EMNLP 2021 
Viaarxiv icon

Wine is Not v i n. -- On the Compatibility of Tokenizations Across Languages

Sep 13, 2021
Antonis Maronikolakis, Philipp Dufter, Hinrich Schütze

Figure 1 for Wine is Not v i n. -- On the Compatibility of Tokenizations Across Languages
Figure 2 for Wine is Not v i n. -- On the Compatibility of Tokenizations Across Languages
Figure 3 for Wine is Not v i n. -- On the Compatibility of Tokenizations Across Languages
Figure 4 for Wine is Not v i n. -- On the Compatibility of Tokenizations Across Languages

The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflects the compatibility of tokenizations across languages. Our goal is to prevent incompatible tokenizations, e.g., "wine" (word-level) in English vs.\ "v i n" (character-level) in French, which make it hard to learn good multilingual semantic representations. We show that our compatibility measure allows the system designer to create vocabularies across languages that are compatible -- a desideratum that so far has been neglected in multilingual models.

* Accepted at EMNLP 2021 Findings 
Viaarxiv icon

ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus

Jul 15, 2021
Ayyoob Imani, Masoud Jalili Sabet, Philipp Dufter, Michael Cysouw, Hinrich Schütze

Figure 1 for ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus
Figure 2 for ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus
Figure 3 for ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus
Figure 4 for ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus

With more than 7000 languages worldwide, multilingual natural language processing (NLP) is essential both from an academic and commercial perspective. Researching typological properties of languages is fundamental for progress in multilingual NLP. Examples include assessing language similarity for effective transfer learning, injecting inductive biases into machine learning models or creating resources such as dictionaries and inflection tables. We provide ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages. We give evidence that this is useful for typological research. ParCourE can be set up for any parallel corpus and can thus be used for typological research on other corpora as well as for exploring their quality and properties.

* The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing 
Viaarxiv icon

ParCourE: A Parallel Corpus Explorer fora Massively Multilingual Corpus

Jul 14, 2021
Ayyoob Imani, Masoud Jalili Sabet, Philipp Dufter, Michael Cysouw, Hinrich Schütze

Figure 1 for ParCourE: A Parallel Corpus Explorer fora Massively Multilingual Corpus
Figure 2 for ParCourE: A Parallel Corpus Explorer fora Massively Multilingual Corpus
Figure 3 for ParCourE: A Parallel Corpus Explorer fora Massively Multilingual Corpus
Figure 4 for ParCourE: A Parallel Corpus Explorer fora Massively Multilingual Corpus

With more than 7000 languages worldwide, multilingual natural language processing (NLP) is essential both from an academic and commercial perspective. Researching typological properties of languages is fundamental for progress in multilingual NLP. Examples include assessing language similarity for effective transfer learning, injecting inductive biases into machine learning models or creating resources such as dictionaries and inflection tables. We provide ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages. We give evidence that this is useful for typological research. ParCourE can be set up for any parallel corpus and can thus be used for typological research on other corpora as well as for exploring their quality and properties.

* The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing 
Viaarxiv icon

Static Embeddings as Efficient Knowledge Bases?

Apr 14, 2021
Philipp Dufter, Nora Kassner, Hinrich Schütze

Figure 1 for Static Embeddings as Efficient Knowledge Bases?
Figure 2 for Static Embeddings as Efficient Knowledge Bases?
Figure 3 for Static Embeddings as Efficient Knowledge Bases?
Figure 4 for Static Embeddings as Efficient Knowledge Bases?

Recent research investigates factual knowledge stored in large pretrained language models (PLMs). Instead of structural knowledge base (KB) queries, masked sentences such as "Paris is the capital of [MASK]" are used as probes. The good performance on this analysis task has been interpreted as PLMs becoming potential repositories of factual knowledge. In experiments across ten linguistically diverse languages, we study knowledge contained in static embeddings. We show that, when restricting the output space to a candidate set, simple nearest neighbor matching using static embeddings performs better than PLMs. E.g., static embeddings perform 1.6% points better than BERT while just using 0.3% of energy for training. One important factor in their good comparative performance is that static embeddings are standardly learned for a large vocabulary. In contrast, BERT exploits its more sophisticated, but expensive ability to compose meaningful representations from a much smaller subword vocabulary.

* NAACL2021 CRV; first two authors contributed equally 
Viaarxiv icon

Position Information in Transformers: An Overview

Feb 22, 2021
Philipp Dufter, Martin Schmitt, Hinrich Schütze

Figure 1 for Position Information in Transformers: An Overview
Figure 2 for Position Information in Transformers: An Overview
Figure 3 for Position Information in Transformers: An Overview
Figure 4 for Position Information in Transformers: An Overview

Transformers are arguably the main workhorse in recent Natural Language Processing research. By definition a Transformer is invariant with respect to reorderings of the input. However, language is inherently sequential and word order is essential to the semantics and syntax of an utterance. In this paper, we provide an overview of common methods to incorporate position information into Transformer models. The objectives of this survey are to i) showcase that position information in Transformer is a vibrant and extensive research area; ii) enable the reader to compare existing methods by providing a unified notation and meaningful clustering; iii) indicate what characteristics of an application should be taken into account when selecting a position encoding; iv) provide stimuli for future research.

Viaarxiv icon

Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models

Feb 01, 2021
Nora Kassner, Philipp Dufter, Hinrich Schütze

Figure 1 for Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models
Figure 2 for Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models
Figure 3 for Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models
Figure 4 for Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models

Recently, it has been found that monolingual English language models can be used as knowledge bases. Instead of structural knowledge base queries, masked sentences such as "Paris is the capital of [MASK]" are used as probes. We translate the established benchmarks TREx and GoogleRE into 53 languages. Working with mBERT, we investigate three questions. (i) Can mBERT be used as a multilingual knowledge base? Most prior work only considers English. Extending research to multiple languages is important for diversity and accessibility. (ii) Is mBERT's performance as knowledge base language-independent or does it vary from language to language? (iii) A multilingual model is trained on more text, e.g., mBERT is trained on 104 Wikipedias. Can mBERT leverage this for better performance? We find that using mBERT as a knowledge base yields varying performance across languages and pooling predictions across languages improves performance. Conversely, mBERT exhibits a language bias; e.g., when queried in Italian, it tends to predict Italy as the country of origin.

* Accepted to EACL 2021 
Viaarxiv icon