Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dieuwke Hupkes

Jack

Interpretability of Language Models via Task Spaces

Jun 10, 2024

Lucas Weber, Jaap Jumelet, Elia Bruni, Dieuwke Hupkes

Figure 1 for Interpretability of Language Models via Task Spaces

Figure 2 for Interpretability of Language Models via Task Spaces

Figure 3 for Interpretability of Language Models via Task Spaces

Figure 4 for Interpretability of Language Models via Task Spaces

Abstract:The usual way to interpret language models (LMs) is to test their performance on different benchmarks and subsequently infer their internal processes. In this paper, we present an alternative approach, concentrating on the quality of LM processing, with a focus on their language abilities. To this end, we construct 'linguistic task spaces' -- representations of an LM's language conceptualisation -- that shed light on the connections LMs draw between language phenomena. Task spaces are based on the interactions of the learning signals from different linguistic phenomena, which we assess via a method we call 'similarity probing'. To disentangle the learning signals of linguistic phenomena, we further introduce a method called 'fine-tuning via gradient differentials' (FTGD). We apply our methods to language models of three different scales and find that larger models generalise better to overarching general concepts for linguistic tasks, making better use of their shared structure. Further, the distributedness of linguistic processing increases with pre-training through increased parameter sharing between related linguistic tasks. The overall generalisation patterns are mostly stable throughout training and not marked by incisive stages, potentially explaining the lack of successful curriculum strategies for LMs.

* To be published at ACL 2024 (main)

Via

Access Paper or Ask Questions

From Form to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

Apr 18, 2024

Xenia Ohmer, Elia Bruni, Dieuwke Hupkes

Figure 1 for From Form to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

Figure 2 for From Form to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

Figure 3 for From Form to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

Figure 4 for From Form to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

Abstract:The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what "understanding" means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes - inspired by Fregean senses - of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model's multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.

Via

Access Paper or Ask Questions

The ICL Consistency Test

Dec 08, 2023

Lucas Weber, Elia Bruni, Dieuwke Hupkes

Abstract:Just like the previous generation of task-tuned models, large language models (LLMs) that are adapted to tasks via prompt-based methods like in-context-learning (ICL) perform well in some setups but not in others. This lack of consistency in prompt-based learning hints at a lack of robust generalisation. We here introduce the ICL consistency test -- a contribution to the GenBench collaborative benchmark task (CBT) -- which evaluates how consistent a model makes predictions across many different setups while using the same data. The test is based on different established natural language inference tasks. We provide preprocessed data constituting 96 different 'setups' and a metric that estimates model consistency across these setups. The metric is provided on a fine-grained level to understand what properties of a setup render predictions unstable and on an aggregated level to compare overall model consistency. We conduct an empirical analysis of eight state-of-the-art models, and our consistency metric reveals how all tested LLMs lack robust generalisation.

* Accepted as non-archival submission to the GenBench Workshop 2023. arXiv admin note: substantial text overlap with arXiv:2310.13486

Via

Access Paper or Ask Questions

WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models

Nov 27, 2023

Youssef Benchekroun, Megi Dervishi, Mark Ibrahim, Jean-Baptiste Gaya, Xavier Martinet, Grégoire Mialon, Thomas Scialom, Emmanuel Dupoux, Dieuwke Hupkes, Pascal Vincent

Abstract:We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while finetuning on similar problems does result in substantial improvements -- within- and out-of-distribution -- the finetuned models do not generalise beyond a constraint problem space.

Via

Access Paper or Ask Questions

Memorisation Cartography: Mapping out the Memorisation-Generalisation Continuum in Neural Machine Translation

Nov 09, 2023

Verna Dankers, Ivan Titov, Dieuwke Hupkes

Figure 1 for Memorisation Cartography: Mapping out the Memorisation-Generalisation Continuum in Neural Machine Translation

Figure 2 for Memorisation Cartography: Mapping out the Memorisation-Generalisation Continuum in Neural Machine Translation

Figure 3 for Memorisation Cartography: Mapping out the Memorisation-Generalisation Continuum in Neural Machine Translation

Figure 4 for Memorisation Cartography: Mapping out the Memorisation-Generalisation Continuum in Neural Machine Translation

Abstract:When training a neural network, it will quickly memorise some source-target mappings from your dataset but never learn some others. Yet, memorisation is not easily expressed as a binary feature that is good or bad: individual datapoints lie on a memorisation-generalisation continuum. What determines a datapoint's position on that spectrum, and how does that spectrum influence neural models' performance? We address these two questions for neural machine translation (NMT) models. We use the counterfactual memorisation metric to (1) build a resource that places 5M NMT datapoints on a memorisation-generalisation map, (2) illustrate how the datapoints' surface-level characteristics and a models' per-datum training signals are predictive of memorisation in NMT, (3) and describe the influence that subsets of that map have on NMT systems' performance.

* Published in EMNLP 2023; 21 pages total (9 in the main paper, 3 pages with limitations, acknowledgments and references, 9 pages with appendices)

Via

Access Paper or Ask Questions

The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks

Oct 26, 2023

Kaiser Sun, Adina Williams, Dieuwke Hupkes

Figure 1 for The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks

Figure 2 for The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks

Figure 3 for The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks

Figure 4 for The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks

Abstract:NLP models have progressed drastically in recent years, according to numerous datasets proposed to evaluate performance. Questions remain, however, about how particular dataset design choices may impact the conclusions we draw about model capabilities. In this work, we investigate this question in the domain of compositional generalization. We examine the performance of six modeling approaches across 4 datasets, split according to 8 compositional splitting strategies, ranking models by 18 compositional generalization splits in total. Our results show that: i) the datasets, although all designed to evaluate compositional generalization, rank modeling approaches differently; ii) datasets generated by humans align better with each other than they with synthetic datasets, or than synthetic datasets among themselves; iii) generally, whether datasets are sampled from the same source is more predictive of the resulting model ranking than whether they maintain the same interpretation of compositionality; and iv) which lexical items are used in the data can strongly impact conclusions. Overall, our results demonstrate that much work remains to be done when it comes to assessing whether popular evaluation datasets measure what they intend to measure, and suggest that elucidating more rigorous standards for establishing the validity of evaluation sets could benefit the field.

* CoNLL2023

Via

Access Paper or Ask Questions

Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning

Oct 20, 2023

Lucas Weber, Elia Bruni, Dieuwke Hupkes

Figure 1 for Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning

Figure 2 for Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning

Figure 3 for Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning

Figure 4 for Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning

Abstract:Finding the best way of adapting pre-trained language models to a task is a big challenge in current NLP. Just like the previous generation of task-tuned models (TT), models that are adapted to tasks via in-context-learning (ICL) are robust in some setups but not in others. Here, we present a detailed analysis of which design choices cause instabilities and inconsistencies in LLM predictions. First, we show how spurious correlations between input distributions and labels -- a known issue in TT models -- form only a minor problem for prompted models. Then, we engage in a systematic, holistic evaluation of different factors that have been found to influence predictions in a prompting setup. We test all possible combinations of a range of factors on both vanilla and instruction-tuned (IT) LLMs of different scale and statistically analyse the results to show which factors are the most influential, interactive or stable. Our results show which factors can be used without precautions and which should be avoided or handled with care in most settings.

Via

Access Paper or Ask Questions

Curriculum Learning with Adam: The Devil Is in the Wrong Details

Aug 23, 2023

Lucas Weber, Jaap Jumelet, Paul Michel, Elia Bruni, Dieuwke Hupkes

Abstract:Curriculum learning (CL) posits that machine learning models -- similar to humans -- may learn more efficiently from data that match their current learning progress. However, CL methods are still poorly understood and, in particular for natural language processing (NLP), have achieved only limited success. In this paper, we explore why. Starting from an attempt to replicate and extend a number of recent curriculum methods, we find that their results are surprisingly brittle when applied to NLP. A deep dive into the (in)effectiveness of the curricula in some scenarios shows us why: when curricula are employed in combination with the popular Adam optimisation algorithm, they oftentimes learn to adapt to suboptimally chosen optimisation parameters for this algorithm. We present a number of different case studies with different common hand-crafted and automated CL approaches to illustrate this phenomenon, and we find that none of them outperforms optimisation with only Adam with well-chosen hyperparameters. As such, our results contribute to understanding why CL methods work, but at the same time urge caution when claiming positive results.

Via

Access Paper or Ask Questions

Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses

May 23, 2023

Xenia Ohmer, Elia Bruni, Dieuwke Hupkes

Abstract:At the staggering pace with which the capabilities of large language models (LLMs) are increasing, creating future-proof evaluation sets to assess their understanding becomes more and more challenging. In this paper, we propose a novel paradigm for evaluating LLMs which leverages the idea that correct world understanding should be consistent across different (Fregean) senses of the same meaning. Accordingly, we measure understanding not in terms of correctness but by evaluating consistency across multiple senses that are generated by the model itself. We showcase our approach by instantiating a test where the different senses are different languages, hence using multilingual self-consistency as a litmus test for the model's understanding and simultaneously addressing the important topic of multilingualism. Taking one of the latest versions of ChatGPT as our object of study, we evaluate multilingual consistency for two different tasks across three different languages. We show that its multilingual consistency is still lacking, and that its task and world understanding are thus not language-independent. As our approach does not require any static evaluation corpora in languages other than English, it can easily and cheaply be extended to different languages and tasks and could become an integral part of future benchmarking efforts.

Via

Access Paper or Ask Questions

The Curious Case of Absolute Position Embeddings

Oct 23, 2022

Koustuv Sinha, Amirhossein Kazemnejad, Siva Reddy, Joelle Pineau, Dieuwke Hupkes, Adina Williams

Figure 1 for The Curious Case of Absolute Position Embeddings

Figure 2 for The Curious Case of Absolute Position Embeddings

Figure 3 for The Curious Case of Absolute Position Embeddings

Figure 4 for The Curious Case of Absolute Position Embeddings

Abstract:Transformer language models encode the notion of word order using positional information. Most commonly, this positional information is represented by absolute position embeddings (APEs), that are learned from the pretraining data. However, in natural language, it is not absolute position that matters, but relative position, and the extent to which APEs can capture this type of information has not been investigated. In this work, we observe that models trained with APE over-rely on positional information to the point that they break-down when subjected to sentences with shifted position information. Specifically, when models are subjected to sentences starting from a non-zero position (excluding the effect of priming), they exhibit noticeably degraded performance on zero to full-shot tasks, across a range of model families and model sizes. Our findings raise questions about the efficacy of APEs to model the relativity of position information, and invite further introspection on the sentence and word order processing strategies employed by these models.

* Accepted at EMNLP 2022 Findings; 5 pages and 15 pages Appendix

Via

Access Paper or Ask Questions