Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

James A. Michaelov

Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events

Jun 07, 2025

James A. Michaelov, Reeka Estacio, Zhien Zhang, Benjamin K. Bergen

Abstract:Can language models reliably predict that possible events are more likely than merely improbable ones? By teasing apart possibility, typicality, and contextual relatedness, we show that despite the results of previous work, language models' ability to do this is far from robust. In fact, under certain conditions, all models tested - including Llama 3, Gemma 2, and Mistral NeMo - perform at worse-than-chance level, assigning higher probabilities to impossible sentences such as 'the car was given a parking ticket by the brake' than to merely unlikely sentences such as 'the car was given a parking ticket by the explorer'.

* Accepted to Findings of ACL 2025

Via

Access Paper or Ask Questions

On the Acquisition of Shared Grammatical Representations in Bilingual Language Models

Mar 05, 2025

Catherine Arnett, Tyler A. Chang, James A. Michaelov, Benjamin K. Bergen

Abstract:While crosslingual transfer is crucial to contemporary language models' multilingual capabilities, how it occurs is not well understood. In this paper, we ask what happens to a monolingual language model when it begins to be trained on a second language. Specifically, we train small bilingual models for which we control the amount of data for each language and the order of language exposure. To find evidence of shared multilingual representations, we turn to structural priming, a method used to study grammatical representations in humans. We first replicate previous crosslingual structural priming results and find that after controlling for training data quantity and language exposure, there are asymmetrical effects across language pairs and directions. We argue that this asymmetry may shape hypotheses about human structural priming effects. We also find that structural priming effects are less robust for less similar language pairs, highlighting potential limitations of crosslingual transfer learning and shared representations for typologically diverse languages.

Via

Access Paper or Ask Questions

Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics

Apr 30, 2024

James A. Michaelov, Catherine Arnett, Benjamin K. Bergen

Abstract:Transformers have supplanted Recurrent Neural Networks as the dominant architecture for both natural language processing tasks and, despite criticisms of cognitive implausibility, for modelling the effect of predictability on online human language comprehension. However, two recently developed recurrent neural network architectures, RWKV and Mamba, appear to perform natural language tasks comparably to or better than transformers of equivalent scale. In this paper, we show that contemporary recurrent models are now also able to match - and in some cases, exceed - performance of comparably sized transformers at modeling online human language comprehension. This suggests that transformer language models are not uniquely suited to this task, and opens up new directions for debates about the extent to which architectural features of language models make them better or worse models of human language comprehension.

Via

Access Paper or Ask Questions

Structural Priming Demonstrates Abstract Grammatical Representations in Multilingual Language Models

Nov 15, 2023

James A. Michaelov, Catherine Arnett, Tyler A. Chang, Benjamin K. Bergen

Abstract:Abstract grammatical knowledge - of parts of speech and grammatical patterns - is key to the capacity for linguistic generalization in humans. But how abstract is grammatical knowledge in large language models? In the human literature, compelling evidence for grammatical abstraction comes from structural priming. A sentence that shares the same grammatical structure as a preceding sentence is processed and produced more readily. Because confounds exist when using stimuli in a single language, evidence of abstraction is even more compelling from crosslingual structural priming, where use of a syntactic structure in one language primes an analogous structure in another language. We measure crosslingual structural priming in large language models, comparing model behavior to human experimental results from eight crosslingual experiments covering six languages, and four monolingual structural priming experiments in three non-English languages. We find evidence for abstract monolingual and crosslingual grammatical representations in the models that function similarly to those found in humans. These results demonstrate that grammatical representations in multilingual language models are not only similar across languages, but they can causally influence text produced in different languages.

* Accepted at EMNLP 2023

Via

Access Paper or Ask Questions

Crosslingual Structural Priming and the Pre-Training Dynamics of Bilingual Language Models

Oct 11, 2023

Catherine Arnett, Tyler A. Chang, James A. Michaelov, Benjamin K. Bergen

Abstract:Do multilingual language models share abstract grammatical representations across languages, and if so, when do these develop? Following Sinclair et al. (2022), we use structural priming to test for abstract grammatical representations with causal effects on model outputs. We extend the approach to a Dutch-English bilingual setting, and we evaluate a Dutch-English language model during pre-training. We find that crosslingual structural priming effects emerge early after exposure to the second language, with less than 1M tokens of data in that language. We discuss implications for data contamination, low-resource transfer, and how abstract grammatical representations emerge in multilingual models.

* Extended abstract accepted to the 3rd Multilingual Representation Learning workshop at EMNLP 2023

Via

Access Paper or Ask Questions

Emergent inabilities? Inverse scaling over the course of pretraining

May 24, 2023

James A. Michaelov, Benjamin K. Bergen

Abstract:Does inverse scaling only occur as a function of model parameter size, or can it also occur over the course of training? We carry out an exploratory study investigating whether, over the course of training on the language modeling task, the performance of language models at specific tasks can decrease while general performance remains high. We find that for two tasks from the Inverse Scaling Challenge - quote-repetition and redefine-math - this is indeed the case. Specifically, we find that for Pythia (Biderman et al., 2023) models with a higher number of parameters, performance decreases over the course of training at these two tasks, despite these models showing standard (positive) scaling overall. This highlights the importance of testing model performance at all relevant benchmarks any time they are trained on additional data, even if their overall performance improves.

Via

Access Paper or Ask Questions

Can Peanuts Fall in Love with Distributional Semantics?

Jan 20, 2023

James A. Michaelov, Seana Coulson, Benjamin K. Bergen

Abstract:The context in which a sentence appears can drastically alter our expectations about upcoming words - for example, following a short story involving an anthropomorphic peanut, experimental participants are more likely to expect the sentence 'the peanut was in love' than 'the peanut was salted', as indexed by N400 amplitude (Nieuwland & van Berkum, 2006). This rapid and dynamic updating of comprehenders' expectations about the kind of events that a peanut may take part in based on context has been explained using the construct of Situation Models - updated mental representations of key elements of an event under discussion, in this case, the peanut protagonist. However, recent work showing that N400 amplitude can be predicted based on distributional information alone raises the question whether situation models are in fact necessary for the kinds of contextual effects observed in previous work. To investigate this question, we attempt to model the results of Nieuwland and van Berkum (2006) using six computational language models and three sets of word vectors, none of which have explicit situation models or semantic grounding. We find that the effect found by Nieuwland and van Berkum (2006) can be fully modeled by two language models and two sets of word vectors, with others showing a reduced effect. Thus, at least some processing effects normally explained through situation models may not in fact require explicit situation models.

Via

Access Paper or Ask Questions

'Rarely' a problem? Language models exhibit inverse scaling in their predictions following 'few'-type quantifiers

Dec 16, 2022

James A. Michaelov, Benjamin K. Bergen

Figure 1 for 'Rarely' a problem? Language models exhibit inverse scaling in their predictions following 'few'-type quantifiers

Figure 2 for 'Rarely' a problem? Language models exhibit inverse scaling in their predictions following 'few'-type quantifiers

Figure 3 for 'Rarely' a problem? Language models exhibit inverse scaling in their predictions following 'few'-type quantifiers

Figure 4 for 'Rarely' a problem? Language models exhibit inverse scaling in their predictions following 'few'-type quantifiers

Abstract:Language Models appear to perform poorly on quantification. We ask how badly. 'Few'-type quantifiers, as in 'few children like vegetables' might pose a particular challenge for Language Models, since the sentence components without the quantifier are likely to co-occur, and because 'few'-type quantifiers are rare. We present 960 sentences stimuli from two human neurolinguistic experiments to 22 autoregressive transformer models of differing sizes. Not only do the models perform poorly on 'few'-type quantifiers, but overall the larger the model, the worse its performance. We interpret this inverse scaling as suggesting that larger models increasingly reflect online rather than offline human processing, and argue that decreasing performance of larger models may challenge uses of Language Models as the basis for Natural Language Systems.

Via

Access Paper or Ask Questions

Collateral facilitation in humans and language models

Nov 09, 2022

James A. Michaelov, Benjamin K. Bergen

Abstract:Are the predictions of humans and language models affected by similar things? Research suggests that while comprehending language, humans make predictions about upcoming words, with more predictable words being processed more easily. However, evidence also shows that humans display a similar processing advantage for highly anomalous words when these words are semantically related to the preceding context or to the most probable continuation. Using stimuli from 3 psycholinguistic experiments, we find that this is also almost always also the case for 8 contemporary transformer language models (BERT, ALBERT, RoBERTa, XLM-R, GPT-2, GPT-Neo, GPT-J, and XGLM). We then discuss the implications of this phenomenon for our understanding of both human language comprehension and the predictions made by language models.

* Accepted at CoNLL 2022

Via

Access Paper or Ask Questions

Do language models make human-like predictions about the coreferents of Italian anaphoric zero pronouns?

Aug 30, 2022

James A. Michaelov, Benjamin K. Bergen

Figure 1 for Do language models make human-like predictions about the coreferents of Italian anaphoric zero pronouns?

Figure 2 for Do language models make human-like predictions about the coreferents of Italian anaphoric zero pronouns?

Figure 3 for Do language models make human-like predictions about the coreferents of Italian anaphoric zero pronouns?

Figure 4 for Do language models make human-like predictions about the coreferents of Italian anaphoric zero pronouns?

Abstract:Some languages allow arguments to be omitted in certain contexts. Yet human language comprehenders reliably infer the intended referents of these zero pronouns, in part because they construct expectations about which referents are more likely. We ask whether Neural Language Models also extract the same expectations. We test whether 12 contemporary language models display expectations that reflect human behavior when exposed to sentences with zero pronouns from five behavioral experiments conducted in Italian by Carminati (2005). We find that three models - XGLM 2.9B, 4.5B, and 7.5B - capture the human behavior from all the experiments, with others successfully modeling some of the results. This result suggests that human expectations about coreference can be derived from exposure to language, and also indicates features of language models that allow them to better reflect human behavior.

* Accepted at COLING 2022

Via

Access Paper or Ask Questions