Alert button
Picture for Éric de la Clergerie

Éric de la Clergerie

Alert button

Headless Language Models: Learning without Predicting with Contrastive Weight Tying

Sep 15, 2023
Nathan Godey, Éric de la Clergerie, Benoît Sagot

Figure 1 for Headless Language Models: Learning without Predicting with Contrastive Weight Tying
Figure 2 for Headless Language Models: Learning without Predicting with Contrastive Weight Tying
Figure 3 for Headless Language Models: Learning without Predicting with Contrastive Weight Tying
Figure 4 for Headless Language Models: Learning without Predicting with Contrastive Weight Tying

Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies. In this study, we propose an innovative method that shifts away from probability prediction and instead focuses on reconstructing input embeddings in a contrastive fashion via Constrastive Weight Tying (CWT). We apply this approach to pretrain Headless Language Models in both monolingual and multilingual contexts. Our method offers practical advantages, substantially reducing training computational requirements by up to 20 times, while simultaneously enhancing downstream performance and data efficiency. We observe a significant +1.6 GLUE score increase and a notable +2.7 LAMBADA accuracy improvement compared to classical LMs within similar compute budgets.

Viaarxiv icon

Is Anisotropy Inherent to Transformers?

Jun 13, 2023
Nathan Godey, Éric de la Clergerie, Benoît Sagot

Figure 1 for Is Anisotropy Inherent to Transformers?
Figure 2 for Is Anisotropy Inherent to Transformers?
Figure 3 for Is Anisotropy Inherent to Transformers?
Figure 4 for Is Anisotropy Inherent to Transformers?

The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of optimizing the cross-entropy loss on long-tailed distributions of tokens. We show in this paper that anisotropy can also be observed empirically in language models with specific objectives that should not suffer directly from the same consequences. We also show that the anisotropy problem extends to Transformers trained on other modalities. Our observations tend to demonstrate that anisotropy might actually be inherent to Transformers-based models.

* ACL-SRW 2023 (Poster) 
Viaarxiv icon

MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling

Dec 14, 2022
Nathan Godey, Roman Castagné, Éric de la Clergerie, Benoît Sagot

Figure 1 for MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling
Figure 2 for MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling
Figure 3 for MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling
Figure 4 for MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling

Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this work, we propose MANTa, a Module for Adaptive Neural TokenizAtion. MANTa is a differentiable tokenizer trained end-to-end with the language model. The resulting system offers a trade-off between the expressiveness of byte-level models and the speed of models trained using subword tokenization. In addition, our tokenizer is highly explainable since it produces an explicit segmentation of sequences into blocks. We evaluate our pre-trained model on several English datasets from different domains as well as on synthetic noise. We find that MANTa improves robustness to character perturbations and out-of-domain data. We then show that MANTa performs comparably to other models on the general-domain GLUE benchmark. Finally, we show that it is considerably faster than strictly byte-level models.

* EMNLP 2022 Findings (https://aclanthology.org/2022.findings-emnlp.207/) 
Viaarxiv icon

Clustering-based Automatic Construction of Legal Entity Knowledge Base from Contracts

Dec 07, 2020
Fuqi Song, Éric de la Clergerie

Figure 1 for Clustering-based Automatic Construction of Legal Entity Knowledge Base from Contracts
Figure 2 for Clustering-based Automatic Construction of Legal Entity Knowledge Base from Contracts
Figure 3 for Clustering-based Automatic Construction of Legal Entity Knowledge Base from Contracts
Figure 4 for Clustering-based Automatic Construction of Legal Entity Knowledge Base from Contracts

In contract analysis and contract automation, a knowledge base (KB) of legal entities is fundamental for performing tasks such as contract verification, contract generation and contract analytic. However, such a KB does not always exist nor can be produced in a short time. In this paper, we propose a clustering-based approach to automatically generate a reliable knowledge base of legal entities from given contracts without any supplemental references. The proposed method is robust to different types of errors brought by pre-processing such as Optical Character Recognition (OCR) and Named Entity Recognition (NER), as well as editing errors such as typos. We evaluate our method on a dataset that consists of 800 real contracts with various qualities from 15 clients. Compared to the collected ground-truth data, our method is able to recall 84\% of the knowledge.

* 4 pages, 3 figures 
Viaarxiv icon

Multilingual Unsupervised Sentence Simplification

May 01, 2020
Louis Martin, Angela Fan, Éric de la Clergerie, Antoine Bordes, Benoît Sagot

Figure 1 for Multilingual Unsupervised Sentence Simplification
Figure 2 for Multilingual Unsupervised Sentence Simplification
Figure 3 for Multilingual Unsupervised Sentence Simplification
Figure 4 for Multilingual Unsupervised Sentence Simplification

Progress in Sentence Simplification has been hindered by the lack of supervised data, particularly in languages other than English. Previous work has aligned sentences from original and simplified corpora such as English Wikipedia and Simple English Wikipedia, but this limits corpus size, domain, and language. In this work, we propose using unsupervised mining techniques to automatically create training corpora for simplification in multiple languages from raw Common Crawl web data. When coupled with a controllable generation mechanism that can flexibly adjust attributes such as length and lexical complexity, these mined paraphrase corpora can be used to train simplification systems in any language. We further incorporate multilingual unsupervised pretraining methods to create even stronger models and show that by training on mined data rather than supervised corpora, we outperform the previous best results. We evaluate our approach on English, French, and Spanish simplification benchmarks and reach state-of-the-art performance with a totally unsupervised approach. We will release our models and code to mine the data in any language included in Common Crawl.

Viaarxiv icon

Controllable Sentence Simplification

Oct 16, 2019
Louis Martin, Benoît Sagot, Éric de la Clergerie, Antoine Bordes

Figure 1 for Controllable Sentence Simplification
Figure 2 for Controllable Sentence Simplification
Figure 3 for Controllable Sentence Simplification
Figure 4 for Controllable Sentence Simplification

Text simplification aims at making a text easier to read and understand by simplifying grammar and structure while keeping the underlying information identical. It is often considered an all-purpose generic task where the same simplification is suitable for all; however multiple audiences can benefit from simplified text in different ways. We adapt a discrete parametrization mechanism that provides explicit control on simplification systems based on Sequence-to-Sequence models. As a result, users can condition the simplifications returned by a model on parameters such as length, amount of paraphrasing, lexical complexity and syntactic complexity. We also show that carefully chosen values of these parameters allow out-of-the-box Sequence-to-Sequence models to outperform their standard counterparts on simplification benchmarks. Our model, which we call ACCESS (as shorthand for AudienCe-CEntric Sentence Simplification), increases the state of the art to 41.87 SARI on the WikiLarge test set, a +1.42 gain over previously reported scores.

* Code and models: https://github.com/facebookresearch/access 
Viaarxiv icon