Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mikel Artetxe

On the Role of Parallel Data in Cross-lingual Transfer Learning

Dec 20, 2022
Machel Reid, Mikel Artetxe

Figure 1 for On the Role of Parallel Data in Cross-lingual Transfer Learning

Figure 2 for On the Role of Parallel Data in Cross-lingual Transfer Learning

Figure 3 for On the Role of Parallel Data in Cross-lingual Transfer Learning

While prior work has established that the use of parallel data is conducive for cross-lingual learning, it is unclear if the improvements come from the data itself, or if it is the modeling of parallel interactions that matters. Exploring this, we examine the usage of unsupervised machine translation to generate synthetic parallel data, and compare it to supervised machine translation and gold parallel data. We find that even model generated parallel data can be useful for downstream tasks, in both a general setting (continued pretraining) as well as the task-specific setting (translate-train), although our best results are still obtained using real parallel data. Our findings suggest that existing multilingual models do not exploit the full potential of monolingual data, and prompt the community to reconsider the traditional categorization of cross-lingual learning approaches.

* Preprint

Via

Access Paper or Ask Questions

Training Trajectories of Language Models Across Scales

Dec 19, 2022
Mengzhou Xia, Mikel Artetxe, Chunting Zhou, Xi Victoria Lin, Ramakanth Pasunuru, Danqi Chen, Luke Zettlemoyer, Ves Stoyanov

Figure 1 for Training Trajectories of Language Models Across Scales

Figure 2 for Training Trajectories of Language Models Across Scales

Figure 3 for Training Trajectories of Language Models Across Scales

Figure 4 for Training Trajectories of Language Models Across Scales

Scaling up language models has led to unprecedented performance gains, but little is understood about how the training dynamics change as models get larger. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors? In this paper, we analyze the intermediate training checkpoints of differently sized OPT models (Zhang et al.,2022)--from 125M to 175B parameters--on next-token prediction, sequence-level generation, and downstream tasks. We find that 1) at a given perplexity and independent of model sizes, a similar subset of training tokens see the most significant reduction in loss, with the rest stagnating or showing double-descent behavior; 2) early in training, all models learn to reduce the perplexity of grammatical sequences that contain hallucinations, with small models halting at this suboptimal distribution and larger ones eventually learning to assign these sequences lower probabilities; 3) perplexity is a strong predictor of in-context learning performance on 74 multiple-choice tasks from BIG-Bench, and this holds independent of the model size. Together, these results show that perplexity is more predictive of model behaviors than model size or training computation.

Via

Access Paper or Ask Questions

Don't Prompt, Search! Mining-based Zero-Shot Learning with Language Models

Oct 26, 2022
Mozes van de Kar, Mengzhou Xia, Danqi Chen, Mikel Artetxe

Figure 1 for Don't Prompt, Search! Mining-based Zero-Shot Learning with Language Models

Figure 2 for Don't Prompt, Search! Mining-based Zero-Shot Learning with Language Models

Figure 3 for Don't Prompt, Search! Mining-based Zero-Shot Learning with Language Models

Figure 4 for Don't Prompt, Search! Mining-based Zero-Shot Learning with Language Models

Masked language models like BERT can perform text classification in a zero-shot fashion by reformulating downstream tasks as text infilling. However, this approach is highly sensitive to the template used to prompt the model, yet practitioners are blind when designing them in strict zero-shot settings. In this paper, we propose an alternative mining-based approach for zero-shot learning. Instead of prompting language models, we use regular expressions to mine labeled examples from unlabeled corpora, which can optionally be filtered through prompting, and used to finetune a pretrained model. Our method is more flexible and interpretable than prompting, and outperforms it on a wide range of tasks when using comparable templates. Our results suggest that the success of prompting can partly be explained by the model being exposed to similar examples during pretraining, which can be directly retrieved through regular expressions.

* EMNLP 2022

Via

Access Paper or Ask Questions

State-of-the-art generalisation research in NLP: a taxonomy and review

Oct 10, 2022
Dieuwke Hupkes, Mario Giulianelli, Verna Dankers, Mikel Artetxe, Yanai Elazar, Tiago Pimentel, Christos Christodoulopoulos, Karim Lasri, Naomi Saphra, Arabella Sinclair, Dennis Ulmer, Florian Schottmann, Khuyagbaatar Batsuren, Kaiser Sun, Koustuv Sinha, Leila Khalatbari, Maria Ryskina, Rita Frieske, Ryan Cotterell, Zhijing Jin

Figure 1 for State-of-the-art generalisation research in NLP: a taxonomy and review

Figure 2 for State-of-the-art generalisation research in NLP: a taxonomy and review

Figure 3 for State-of-the-art generalisation research in NLP: a taxonomy and review

Figure 4 for State-of-the-art generalisation research in NLP: a taxonomy and review

The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what `good generalisation' entails and how it should be evaluated is not well understood, nor are there any common standards to evaluate it. In this paper, we aim to lay the ground-work to improve both of these issues. We present a taxonomy for characterising and understanding generalisation research in NLP, we use that taxonomy to present a comprehensive map of published generalisation studies, and we make recommendations for which areas might deserve attention in the future. Our taxonomy is based on an extensive literature review of generalisation research, and contains five axes along which studies can differ: their main motivation, the type of generalisation they aim to solve, the type of data shift they consider, the source by which this data shift is obtained, and the locus of the shift within the modelling pipeline. We use our taxonomy to classify over 400 previous papers that test generalisation, for a total of more than 600 individual experiments. Considering the results of this review, we present an in-depth analysis of the current state of generalisation research in NLP, and make recommendations for the future. Along with this paper, we release a webpage where the results of our review can be dynamically explored, and which we intend to up-date as new NLP generalisation studies are published. With this work, we aim to make steps towards making state-of-the-art generalisation testing the new status quo in NLP.

* 35 pages of content + 53 pages of references

Via

Access Paper or Ask Questions

Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models

Jun 08, 2022
Mengzhou Xia, Mikel Artetxe, Jingfei Du, Danqi Chen, Ves Stoyanov

Figure 1 for Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models

Figure 2 for Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models

Figure 3 for Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models

Figure 4 for Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models

Pre-trained masked language models successfully perform few-shot learning by formulating downstream tasks as text infilling. However, as a strong alternative in full-shot settings, discriminative pre-trained models like ELECTRA do not fit into the paradigm. In this work, we adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks. ELECTRA is pre-trained to distinguish if a token is generated or original. We naturally extend that to prompt-based few-shot learning by training to score the originality of the target options without introducing new parameters. Our method can be easily adapted to tasks involving multi-token predictions without extra computation overhead. Analysis shows that ELECTRA learns distributions that align better with downstream tasks.

* The code is available at https://github.com/facebookresearch/ELECTRA-Fewshot-Learning

Via

Access Paper or Ask Questions

Principled Paraphrase Generation with Parallel Corpora

May 24, 2022
Aitor Ormazabal, Mikel Artetxe, Gorka Labaka, Aitor Soroa, Eneko Agirre

Figure 1 for Principled Paraphrase Generation with Parallel Corpora

Figure 2 for Principled Paraphrase Generation with Parallel Corpora

Figure 3 for Principled Paraphrase Generation with Parallel Corpora

Figure 4 for Principled Paraphrase Generation with Parallel Corpora

Round-trip Machine Translation (MT) is a popular choice for paraphrase generation, which leverages readily available parallel corpora for supervision. In this paper, we formalize the implicit similarity function induced by this approach, and show that it is susceptible to non-paraphrase pairs sharing a single ambiguous translation. Based on these insights, we design an alternative similarity metric that mitigates this issue by requiring the entire translation distribution to match, and implement a relaxation of it through the Information Bottleneck method. Our approach incorporates an adversarial term into MT training in order to learn representations that encode as much information about the reference translation as possible, while keeping as little information about the input as possible. Paraphrases can be generated by decoding back to the source from this representation, without having to generate pivot translations. In addition to being more principled and efficient than round-trip MT, our approach offers an adjustable parameter to control the fidelity-diversity trade-off, and obtains better results in our experiments.

* ACL 2022

Via

Access Paper or Ask Questions

PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation

May 24, 2022
Aitor Ormazabal, Mikel Artetxe, Manex Agirrezabal, Aitor Soroa, Eneko Agirre

Figure 1 for PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation

Figure 2 for PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation

Figure 3 for PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation

Figure 4 for PoeLM: A Meter- and Rhyme-Controllable Language Model for Unsupervised Poetry Generation

Formal verse poetry imposes strict constraints on the meter and rhyme scheme of poems. Most prior work on generating this type of poetry uses existing poems for supervision, which are difficult to obtain for most languages and poetic forms. In this work, we propose an unsupervised approach to generate poems following any given meter and rhyme scheme, without requiring any poetic text for training. Our method works by splitting a regular, non-poetic corpus into phrases, prepending control codes that describe the length and end rhyme of each phrase, and training a transformer language model in the augmented corpus. During inference, we build control codes for the desired meter and rhyme scheme, and condition our language model on them to generate formal verse poetry. Experiments in Spanish and Basque show that our approach is able to generate valid poems, which are often comparable in quality to those written by humans.

Via

Access Paper or Ask Questions

On the Role of Bidirectionality in Language Model Pre-Training

May 24, 2022
Mikel Artetxe, Jingfei Du, Naman Goyal, Luke Zettlemoyer, Ves Stoyanov

Figure 1 for On the Role of Bidirectionality in Language Model Pre-Training

Figure 2 for On the Role of Bidirectionality in Language Model Pre-Training

Figure 3 for On the Role of Bidirectionality in Language Model Pre-Training

Figure 4 for On the Role of Bidirectionality in Language Model Pre-Training

Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, including fully unidirectional models like GPT, fully bidirectional models like BERT, and hybrid models like CM3 and prefix LM. Our framework distinguishes between two notions of bidirectionality (bidirectional context and bidirectional attention) and allows us to control each of them separately. We find that the optimal configuration is largely application-dependent (e.g., bidirectional attention is beneficial for fine-tuning and infilling, but harmful for next token prediction and zero-shot priming). We train models with up to 6.7B parameters, and find differences to remain consistent at scale. While prior work on scaling has focused on left-to-right autoregressive models, our results suggest that this approach comes with some trade-offs, and it might be worthwhile to develop very large bidirectional models.

Via

Access Paper or Ask Questions

Multilingual Machine Translation with Hyper-Adapters

May 22, 2022
Christos Baziotis, Mikel Artetxe, James Cross, Shruti Bhosale

Figure 1 for Multilingual Machine Translation with Hyper-Adapters

Figure 2 for Multilingual Machine Translation with Hyper-Adapters

Figure 3 for Multilingual Machine Translation with Hyper-Adapters

Figure 4 for Multilingual Machine Translation with Hyper-Adapters

Multilingual machine translation suffers from negative interference across languages. A common solution is to relax parameter sharing with language-specific modules like adapters. However, adapters of related languages are unable to transfer information, and their total number of parameters becomes prohibitively expensive as the number of languages grows. In this work, we overcome these drawbacks using hyper-adapters -- hyper-networks that generate adapters from language and layer embeddings. While past work had poor results when scaling hyper-networks, we propose a rescaling fix that significantly improves convergence and enables training larger hyper-networks. We find that hyper-adapters are more parameter efficient than regular adapters, reaching the same performance with up to 12 times less parameters. When using the same number of parameters and FLOPS, our approach consistently outperforms regular adapters. Also, hyper-adapters converge faster than alternative approaches and scale better than regular dense networks. Our analysis shows that hyper-adapters learn to encode language relatedness, enabling positive transfer across languages.

Via

Access Paper or Ask Questions