Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marius Mosbach

Fusing Sentence Embeddings Into LSTM-based Autoregressive Language Models

Aug 05, 2022

Vilém Zouhar, Marius Mosbach, Dietrich Klakow

Figure 1 for Fusing Sentence Embeddings Into LSTM-based Autoregressive Language Models

Figure 2 for Fusing Sentence Embeddings Into LSTM-based Autoregressive Language Models

Figure 3 for Fusing Sentence Embeddings Into LSTM-based Autoregressive Language Models

Figure 4 for Fusing Sentence Embeddings Into LSTM-based Autoregressive Language Models

Abstract:Although masked language models are highly performant and widely adopted by NLP practitioners, they can not be easily used for autoregressive language modelling (next word prediction and sequence probability estimation). We present an LSTM-based autoregressive language model which uses prefix embeddings (from a pretrained masked language model) via fusion (e.g. concatenation) to obtain a richer context representation for language modelling. We find that fusion helps reliably in lowering the perplexity (16.74 $\rightarrow$ 15.80), which is even preserved after a transfer to a dataset from a different domain than the training data. We also evaluate the best-performing fusion model by correlating its next word surprisal estimates with human reading times. Contradicting our expectation, and despite the improvement in perplexity overall, the correlation remains the same as for the baseline model. Lastly, while we focus on language models pre-trained on text as the sources for the fusion, our approach can be possibly extended to fuse any information represented as a fixed-size vector into an auto-regressive language model. These include e.g. sentence external information retrieved for a knowledge base or representations of multi-modal encoders.

* Submitted to PBML. Code & experiment repository: https://github.com/zouharvi/sentence-embd-fusion

Via

Access Paper or Ask Questions

Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Jul 28, 2022

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, Yoav Goldberg

Figure 1 for Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Figure 2 for Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Figure 3 for Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Figure 4 for Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Abstract:Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models. But what exactly in the training data causes a model to make a certain prediction? We seek to answer this question by providing a language for describing how training data influences predictions, through a causal framework. Importantly, our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone. Addressing the problem of extracting factual knowledge from pretrained language models (PLMs), we focus on simple data statistics such as co-occurrence counts and show that these statistics do influence the predictions of PLMs, suggesting that such models rely on shallow heuristics. Our causal framework and our results demonstrate the importance of studying datasets and the benefits of causality for understanding NLP models.

Via

Access Paper or Ask Questions

StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes

May 27, 2022

Awantee Deshpande, Dana Ruiter, Marius Mosbach, Dietrich Klakow

Figure 1 for StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes

Figure 2 for StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes

Figure 3 for StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes

Figure 4 for StereoKG: Data-Driven Knowledge Graph Construction for Cultural Knowledge and Stereotypes

Abstract:Analyzing ethnic or religious bias is important for improving fairness, accountability, and transparency of natural language processing models. However, many techniques rely on human-compiled lists of bias terms, which are expensive to create and are limited in coverage. In this study, we present a fully data-driven pipeline for generating a knowledge graph (KG) of cultural knowledge and stereotypes. Our resulting KG covers 5 religious groups and 5 nationalities and can easily be extended to include more entities. Our human evaluation shows that the majority (59.2%) of non-singleton entries are coherent and complete stereotypes. We further show that performing intermediate masked language model training on the verbalized KG leads to a higher level of cultural awareness in the model and has the potential to increase classification performance on knowledge-crucial samples on a related task, i.e., hate speech detection.

* 12 pages, 2 figures, accepted as a long paper at WOAH at NAACL 2022

Via

Access Paper or Ask Questions

MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Apr 22, 2022

Miaoran Zhang, Marius Mosbach, David Ifeoluwa Adelani, Michael A. Hedderich, Dietrich Klakow

Figure 1 for MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Figure 2 for MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Figure 3 for MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Figure 4 for MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Abstract:Learning semantically meaningful sentence embeddings is an open problem in natural language processing. In this work, we propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective. Through experiments on a variety of semantic textual similarity tasks, we demonstrate that our approach consistently improves the performance across various datasets and pre-trained encoders. In particular, combining a small amount of multimodal data with a large text-only corpus, we improve the state-of-the-art average Spearman's correlation by 1.7%. By analyzing the properties of the textual embedding space, we show that our model excels in aligning semantically similar sentences, providing an explanation for its improved performance.

* Accepted by NAACL 2022 main conference (short paper), 11 pages

Via

Access Paper or Ask Questions

Knowledge Base Index Compression via Dimensionality and Precision Reduction

Apr 18, 2022

Vilém Zouhar, Marius Mosbach, Miaoran Zhang, Dietrich Klakow

Figure 1 for Knowledge Base Index Compression via Dimensionality and Precision Reduction

Figure 2 for Knowledge Base Index Compression via Dimensionality and Precision Reduction

Figure 3 for Knowledge Base Index Compression via Dimensionality and Precision Reduction

Figure 4 for Knowledge Base Index Compression via Dimensionality and Precision Reduction

Abstract:Recently neural network based approaches to knowledge-intensive NLP tasks, such as question answering, started to rely heavily on the combination of neural retrievers and readers. Retrieval is typically performed over a large textual knowledge base (KB) which requires significant memory and compute resources, especially when scaled up. On HotpotQA we systematically investigate reducing the size of the KB index by means of dimensionality (sparse random projections, PCA, autoencoders) and numerical precision reduction. Our results show that PCA is an easy solution that requires very little data and is only slightly worse than autoencoders, which are less stable. All methods are sensitive to pre- and post-processing and data should always be centered and normalized both before and after dimension reduction. Finally, we show that it is possible to combine PCA with using 1bit per dimension. Overall we achieve (1) 100$\times$ compression with 75%, and (2) 24$\times$ compression with 92% original retrieval performance.

* To be presented at Spa-NLP workshop at ACL 2022

Via

Access Paper or Ask Questions

Multilingual Language Model Adaptive Fine-Tuning: A Study on African Languages

Apr 13, 2022

Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, Dietrich Klakow

Figure 1 for Multilingual Language Model Adaptive Fine-Tuning: A Study on African Languages

Figure 2 for Multilingual Language Model Adaptive Fine-Tuning: A Study on African Languages

Figure 3 for Multilingual Language Model Adaptive Fine-Tuning: A Study on African Languages

Figure 4 for Multilingual Language Model Adaptive Fine-Tuning: A Study on African Languages

Abstract:Multilingual pre-trained language models (PLMs) have demonstrated impressive performance on several downstream tasks on both high resourced and low-resourced languages. However, there is still a large performance drop for languages unseen during pre-training, especially African languages. One of the most effective approaches to adapt to a new language is language adaptive fine-tuning (LAFT) -- fine-tuning a multilingual PLM on monolingual texts of a language using the same pre-training objective. However, African languages with large monolingual texts are few, and adapting to each of them individually takes large disk space and limits the cross-lingual transfer abilities of the resulting models because they have been specialized for a single language. In this paper, we perform multilingual adaptive fine-tuning (MAFT) on 17 most-resourced African languages and three other high-resource languages widely spoken on the African continent -- English, French, and Arabic to encourage cross-lingual transfer learning. Additionally, to further specialize the multilingual PLM, we removed vocabulary tokens from the embedding layer that corresponds to non-African writing scripts before MAFT, thus reducing the model size by around 50\%. Our evaluation on two multilingual PLMs (AfriBERTa and XLM-R) and three NLP tasks (NER, news topic classification, and sentiment classification) shows that our approach is competitive to applying LAFT on individual languages while requiring significantly less disk space. Finally, we show that our adapted PLM also improves the zero-shot cross-lingual transfer abilities of parameter efficient fine-tuning methods.

* Accepted to AfricaNLP 2022 (non-archival)

Via

Access Paper or Ask Questions

Artefact Retrieval: Overview of NLP Models with Knowledge Base Access

Jan 24, 2022

Vilém Zouhar, Marius Mosbach, Debanjali Biswas, Dietrich Klakow

Figure 1 for Artefact Retrieval: Overview of NLP Models with Knowledge Base Access

Figure 2 for Artefact Retrieval: Overview of NLP Models with Knowledge Base Access

Figure 3 for Artefact Retrieval: Overview of NLP Models with Knowledge Base Access

Abstract:Many NLP models gain performance by having access to a knowledge base. A lot of research has been devoted to devising and improving the way the knowledge base is accessed and incorporated into the model, resulting in a number of mechanisms and pipelines. Despite the diversity of proposed mechanisms, there are patterns in the designs of such systems. In this paper, we systematically describe the typology of artefacts (items retrieved from a knowledge base), retrieval mechanisms and the way these artefacts are fused into the model. This further allows us to uncover combinations of design decisions that had not yet been tried. Most of the focus is given to language models, though we also show how question answering, fact-checking and knowledgable dialogue models fit into this system as well. Having an abstract model which can describe the architecture of specific models also helps with transferring these architectures between multiple NLP tasks.

* 11 pages of main content, 7 pages of appendix; presented at AKBC CSRR 2021

Via

Access Paper or Ask Questions

Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study

Jun 16, 2021

Badr M. Abdullah, Marius Mosbach, Iuliia Zaitova, Bernd Möbius, Dietrich Klakow

Figure 1 for Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study

Figure 2 for Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study

Figure 3 for Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study

Figure 4 for Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study

Abstract:Several variants of deep neural networks have been successfully employed for building parametric models that project variable-duration spoken word segments onto fixed-size vector representations, or acoustic word embeddings (AWEs). However, it remains unclear to what degree we can rely on the distance in the emerging AWE space as an estimate of word-form similarity. In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? To answer this question, we empirically investigate the performance of supervised approaches for AWEs with different neural architectures and learning objectives. We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity. Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity. Our findings highlight the necessity to rethink the current intrinsic evaluations for AWEs.

* Accepted in Interspeech 2021

Via

Access Paper or Ask Questions

A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English

Nov 02, 2020

Marius Mosbach, Stefania Degaetano-Ortlieb, Marie-Pauline Krielke, Badr M. Abdullah, Dietrich Klakow

Figure 1 for A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English

Figure 2 for A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English

Figure 3 for A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English

Figure 4 for A Closer Look at Linguistic Knowledge in Masked Language Models: The Case of Relative Clauses in American English

Abstract:Transformer-based language models achieve high performance on various tasks, but we still lack understanding of the kind of linguistic knowledge they learn and rely on. We evaluate three models (BERT, RoBERTa, and ALBERT), testing their grammatical and semantic knowledge by sentence-level probing, diagnostic cases, and masked prediction tasks. We focus on relative clauses (in American English) as a complex phenomenon needing contextual information and antecedent identification to be resolved. Based on a naturalistic dataset, probing shows that all three models indeed capture linguistic knowledge about grammaticality, achieving high performance. Evaluation on diagnostic cases and masked prediction tasks considering fine-grained linguistic knowledge, however, shows pronounced model-specific weaknesses especially on semantic knowledge, strongly impacting models' performance. Our results highlight the importance of (a)model comparison in evaluation task and (b) building up claims of model performance and the linguistic knowledge they capture beyond purely probing-based evaluations.

* Accepted to COLING 2020

Via

Access Paper or Ask Questions

Fusion Models for Improved Visual Captioning

Oct 28, 2020

Marimuthu Kalimuthu, Aditya Mogadala, Marius Mosbach, Dietrich Klakow

Figure 1 for Fusion Models for Improved Visual Captioning

Figure 2 for Fusion Models for Improved Visual Captioning

Figure 3 for Fusion Models for Improved Visual Captioning

Abstract:Visual captioning aims to generate textual descriptions given images. Traditionally, the captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them to often make mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders and coherent text generators. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation and automatic speech recognition. Building on these recent developments, and with an aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder visual captioning frameworks. Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz. Show, Attend, and Tell, for emending both syntactic and semantic errors in captions. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline, indicating the usefulness of our proposed multimodal fusion strategies. Further, we perform a preliminary qualitative analysis on the emended captions and identify error categories based on the type of corrections.

* Under review at "Multi-Modal Deep Learning: Challenges and Applications", ICPR-2020

Via

Access Paper or Ask Questions