Alert button
Picture for Ayyoob Imani

Ayyoob Imani

Alert button

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

May 26, 2023
Ayyoob Imani, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, André F. T. Martins, François Yvon, Hinrich Schütze

Figure 1 for Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Figure 2 for Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Figure 3 for Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
Figure 4 for Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, "help" from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should not limit NLP to a small fraction of the world's languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500.

* ACL 2023 
Viaarxiv icon

RET-LLM: Towards a General Read-Write Memory for Large Language Models

May 23, 2023
Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, Hinrich Schütze

Figure 1 for RET-LLM: Towards a General Read-Write Memory for Large Language Models
Figure 2 for RET-LLM: Towards a General Read-Write Memory for Large Language Models
Figure 3 for RET-LLM: Towards a General Read-Write Memory for Large Language Models
Figure 4 for RET-LLM: Towards a General Read-Write Memory for Large Language Models

Large language models (LLMs) have significantly advanced the field of natural language processing (NLP) through their extensive parameters and comprehensive data utilization. However, existing LLMs lack a dedicated memory unit, limiting their ability to explicitly store and retrieve knowledge for various tasks. In this paper, we propose RET-LLM a novel framework that equips LLMs with a general write-read memory unit, allowing them to extract, store, and recall knowledge from the text as needed for task performance. Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets. The memory unit is designed to be scalable, aggregatable, updatable, and interpretable. Through qualitative evaluations, we demonstrate the superiority of our proposed framework over baseline approaches in question answering tasks. Moreover, our framework exhibits robust performance in handling temporal-based question answering tasks, showcasing its ability to effectively manage time-dependent information.

Viaarxiv icon

Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging

Oct 18, 2022
Ayyoob Imani, Silvia Severini, Masoud Jalili Sabet, François Yvon, Hinrich Schütze

Figure 1 for Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging
Figure 2 for Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging
Figure 3 for Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging
Figure 4 for Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging

Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. In this paper, we propose a novel method for transferring labels from multiple high-resource source to low-resource target languages. We formalize POS tag projection as graph-based label propagation. Given translations of a sentence in multiple languages, we create a graph with words as nodes and alignment links as edges by aligning words for all language pairs. We then propagate node labels from source to target using a Graph Neural Network augmented with transformer layers. We show that our propagation creates training sets that allow us to train POS taggers for a diverse set of languages. When combined with enhanced contextualized embeddings, our method achieves a new state-of-the-art for unsupervised POS tagging of low-resource languages.

* EMNLP 2022 
Viaarxiv icon

$Λ$-DARTS: Mitigating Performance Collapse by Harmonizing Operation Selection among Cells

Oct 14, 2022
Sajad Movahedi, Melika Adabinejad, Ayyoob Imani, Arezou Keshavarz, Mostafa Dehghani, Azadeh Shakery, Babak N. Araabi

Figure 1 for $Λ$-DARTS: Mitigating Performance Collapse by Harmonizing Operation Selection among Cells
Figure 2 for $Λ$-DARTS: Mitigating Performance Collapse by Harmonizing Operation Selection among Cells
Figure 3 for $Λ$-DARTS: Mitigating Performance Collapse by Harmonizing Operation Selection among Cells
Figure 4 for $Λ$-DARTS: Mitigating Performance Collapse by Harmonizing Operation Selection among Cells

Differentiable neural architecture search (DARTS) is a popular method for neural architecture search (NAS), which performs cell-search and utilizes continuous relaxation to improve the search efficiency via gradient-based optimization. The main shortcoming of DARTS is performance collapse, where the discovered architecture suffers from a pattern of declining quality during search. Performance collapse has become an important topic of research, with many methods trying to solve the issue through either regularization or fundamental changes to DARTS. However, the weight-sharing framework used for cell-search in DARTS and the convergence of architecture parameters has not been analyzed yet. In this paper, we provide a thorough and novel theoretical and empirical analysis on DARTS and its point of convergence. We show that DARTS suffers from a specific structural flaw due to its weight-sharing framework that limits the convergence of DARTS to saturation points of the softmax function. This point of convergence gives an unfair advantage to layers closer to the output in choosing the optimal architecture, causing performance collapse. We then propose two new regularization terms that aim to prevent performance collapse by harmonizing operation selection via aligning gradients of layers. Experimental results on six different search spaces and three different datasets show that our method ($\Lambda$-DARTS) does indeed prevent performance collapse, providing justification for our theoretical analysis and the proposed remedy.

Viaarxiv icon

Graph Neural Networks for Multiparallel Word Alignment

Mar 16, 2022
Ayyoob Imani, Lütfi Kerem Şenel, Masoud Jalili Sabet, François Yvon, Hinrich Schütze

Figure 1 for Graph Neural Networks for Multiparallel Word Alignment
Figure 2 for Graph Neural Networks for Multiparallel Word Alignment
Figure 3 for Graph Neural Networks for Multiparallel Word Alignment
Figure 4 for Graph Neural Networks for Multiparallel Word Alignment

After a period of decrease, interest in word alignments is increasing again for their usefulness in domains such as typological research, cross-lingual annotation projection, and machine translation. Generally, alignment algorithms only use bitext and do not make use of the fact that many parallel corpora are multiparallel. Here, we compute high-quality word alignments between multiple language pairs by considering all language pairs together. First, we create a multiparallel word alignment graph, joining all bilingual word alignment pairs in one graph. Next, we use graph neural networks (GNNs) to exploit the graph structure. Our GNN approach (i) utilizes information about the meaning, position, and language of the input words, (ii) incorporates information from multiple parallel sentences, (iii) adds and removes edges from the initial alignments, and (iv) yields a prediction model that can generalize beyond the training sentences. We show that community detection provides valuable information for multiparallel word alignment. Our method outperforms previous work on three word-alignment datasets and on a downstream task.

Viaarxiv icon

Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages

Jan 28, 2022
Silvia Severini, Ayyoob Imani, Philipp Dufter, Hinrich Schütze

Figure 1 for Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages
Figure 2 for Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages
Figure 3 for Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages
Figure 4 for Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages

Parallel corpora are ideal for extracting a multilingual named entity (MNE) resource, i.e., a dataset of names translated into multiple languages. Prior work on extracting MNE datasets from parallel corpora required resources such as large monolingual corpora or word aligners that are unavailable or perform poorly for underresourced languages. We present CLC-BN, a new method for creating an MNE resource, and apply it to the Parallel Bible Corpus, a corpus of more than 1000 languages. CLC-BN learns a neural transliteration model from parallel-corpus statistics, without requiring any other bilingual resources, word aligners, or seed data. Experimental results show that CLC-BN clearly outperforms prior work. We release an MNE resource for 1340 languages and demonstrate its effectiveness in two downstream tasks: knowledge graph augmentation and bilingual lexicon induction.

Viaarxiv icon

Graph Algorithms for Multiparallel Word Alignment

Sep 13, 2021
Ayyoob Imani, Masoud Jalili Sabet, Lütfi Kerem Şenel, Philipp Dufter, François Yvon, Hinrich Schütze

Figure 1 for Graph Algorithms for Multiparallel Word Alignment
Figure 2 for Graph Algorithms for Multiparallel Word Alignment
Figure 3 for Graph Algorithms for Multiparallel Word Alignment
Figure 4 for Graph Algorithms for Multiparallel Word Alignment

With the advent of end-to-end deep learning approaches in machine translation, interest in word alignments initially decreased; however, they have again become a focus of research more recently. Alignments are useful for typological research, transferring formatting like markup to translated texts, and can be used in the decoding of machine translation systems. At the same time, massively multilingual processing is becoming an important NLP scenario, and pretrained language and machine translation models that are truly multilingual are proposed. However, most alignment algorithms rely on bitexts only and do not leverage the fact that many parallel corpora are multiparallel. In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph and then predicting additional edges in the graph. We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction. Our experimental results show absolute improvements in $F_1$ of up to 28% over the baseline bilingual word aligner in different datasets.

* EMNLP 2021 
Viaarxiv icon

ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus

Jul 15, 2021
Ayyoob Imani, Masoud Jalili Sabet, Philipp Dufter, Michael Cysouw, Hinrich Schütze

Figure 1 for ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus
Figure 2 for ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus
Figure 3 for ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus
Figure 4 for ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus

With more than 7000 languages worldwide, multilingual natural language processing (NLP) is essential both from an academic and commercial perspective. Researching typological properties of languages is fundamental for progress in multilingual NLP. Examples include assessing language similarity for effective transfer learning, injecting inductive biases into machine learning models or creating resources such as dictionaries and inflection tables. We provide ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages. We give evidence that this is useful for typological research. ParCourE can be set up for any parallel corpus and can thus be used for typological research on other corpora as well as for exploring their quality and properties.

* The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing 
Viaarxiv icon

ParCourE: A Parallel Corpus Explorer fora Massively Multilingual Corpus

Jul 14, 2021
Ayyoob Imani, Masoud Jalili Sabet, Philipp Dufter, Michael Cysouw, Hinrich Schütze

Figure 1 for ParCourE: A Parallel Corpus Explorer fora Massively Multilingual Corpus
Figure 2 for ParCourE: A Parallel Corpus Explorer fora Massively Multilingual Corpus
Figure 3 for ParCourE: A Parallel Corpus Explorer fora Massively Multilingual Corpus
Figure 4 for ParCourE: A Parallel Corpus Explorer fora Massively Multilingual Corpus

With more than 7000 languages worldwide, multilingual natural language processing (NLP) is essential both from an academic and commercial perspective. Researching typological properties of languages is fundamental for progress in multilingual NLP. Examples include assessing language similarity for effective transfer learning, injecting inductive biases into machine learning models or creating resources such as dictionaries and inflection tables. We provide ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages. We give evidence that this is useful for typological research. ParCourE can be set up for any parallel corpus and can thus be used for typological research on other corpora as well as for exploring their quality and properties.

* The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing 
Viaarxiv icon

Deep Neural Networks for Query Expansion using Word Embeddings

Nov 08, 2018
Ayyoob Imani, Amir Vakili, Ali Montazer, Azadeh Shakery

Figure 1 for Deep Neural Networks for Query Expansion using Word Embeddings
Figure 2 for Deep Neural Networks for Query Expansion using Word Embeddings
Figure 3 for Deep Neural Networks for Query Expansion using Word Embeddings

Query expansion is a method for alleviating the vocabulary mismatch problem present in information retrieval tasks. Previous works have shown that terms selected for query expansion by traditional methods such as pseudo-relevance feedback are not always helpful to the retrieval process. In this paper, we show that this is also true for more recently proposed embedding-based query expansion methods. We then introduce an artificial neural network classifier to predict the usefulness of query expansion terms. This classifier uses term word embeddings as inputs. We perform experiments on four TREC newswire and web collections show that using terms selected by the classifier for expansion significantly improves retrieval performance when compared to competitive baselines. The results are also shown to be more robust than the baselines.

* 8 pages, 1 figure 
Viaarxiv icon