Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Paula Buttery

LLMs Contain Multitudes: How Deployment Context Reshapes Model-Level Preferences and Values

Jun 11, 2026

Filip Trhlik, Aoife O'Flynn, Angela Yu, Arduin Findeis, Paula Buttery

Abstract:Large language models (LLMs) are increasingly characterised in recent evaluation work as having stable, model-level preference and value systems. However, accompanying robustness checks are limited to incidental prompt perturbations such as syntax variation and option reordering. This leaves open whether the measured properties survive when the surrounding task context changes, as it does in most real deployments. We test this directly across two established pairwise paradigms: ranking country preferences and eliciting utility judgements. In both, we make the deployment context -- the high-level task the model is performing while making concrete value-dependent choices -- our controlled variable, varied across framings such as writing a Reddit post or a news article. Across five LLMs and over 1.2M pairwise decisions, deployment context produces variation far larger than prompt paraphrasing and temperature controls. In country preference rankings over 15 countries, context induces widespread, statistically significant rank shifts; the aggregate Global North favouritism reported in prior work is itself context-dependent, with each model's bias shifting systematically across contexts. In utility elicitation over 50 outcomes, broad cross-category ordering is preserved, but fine-grained rankings within domains vary substantially, and cardinal exchange rates between outcomes (e.g. how many lives in one region equal one in another) shift by a factor of 2.47 at the median. Reported model-level preferences and utilities are therefore better understood as context-conditioned measurements than fixed model-level properties: safety guarantees obtained under one framing provide limited assurance in another.

* 68 pages, 54 figures, 54 tables

Via

Access Paper or Ask Questions

Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing

Jan 15, 2026

Filip Trhlik, Andrew Caines, Paula Buttery

Abstract:Pre-trained language models (LMs) have, over the last few years, grown substantially in both societal adoption and training costs. This rapid growth in size has constrained progress in understanding and mitigating their biases. Since re-training LMs is prohibitively expensive, most debiasing work has focused on post-hoc or masking-based strategies, which often fail to address the underlying causes of bias. In this work, we seek to democratise pre-model debiasing research by using low-cost proxy models. Specifically, we investigate BabyLMs, compact BERT-like models trained on small and mutable corpora that can approximate bias acquisition and learning dynamics of larger models. We show that BabyLMs display closely aligned patterns of intrinsic bias formation and performance development compared to standard BERT models, despite their drastically reduced size. Furthermore, correlations between BabyLMs and BERT hold across multiple intra-model and post-model debiasing methods. Leveraging these similarities, we conduct pre-model debiasing experiments with BabyLMs, replicating prior findings and presenting new insights regarding the influence of gender imbalance and toxicity on bias formation. Our results demonstrate that BabyLMs can serve as an effective sandbox for large-scale LMs, reducing pre-training costs from over 500 GPU-hours to under 30 GPU-hours. This provides a way to democratise pre-model debiasing research and enables faster, more accessible exploration of methods for building fairer LMs.

* 21 pages, 18 figures

Via

Access Paper or Ask Questions

Teacher Demonstrations in a BabyLM's Zone of Proximal Development for Contingent Multi-Turn Interaction

Oct 23, 2025

Suchir Salhan, Hongyi Gu, Donya Rooein, Diana Galvan-Sosa, Gabrielle Gaudeau, Andrew Caines, Zheng Yuan, Paula Buttery

Abstract:Multi-turn dialogues between a child and a caregiver are characterized by a property called contingency - that is, prompt, direct, and meaningful exchanges between interlocutors. We introduce ContingentChat, a teacher-student framework that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M words. Using a novel alignment dataset for post-training, BabyLM generates responses that are more grammatical and cohesive. Experiments with adaptive teacher decoding strategies show limited additional gains. ContingentChat demonstrates the benefits of targeted post-training for dialogue quality and indicates that contingency remains a challenging goal for BabyLMs.

* Outstanding Paper Award, EMNLP 2025 BabyLM Workshop - Oral presentation, Suzhou, China

Via

Access Paper or Ask Questions

Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

Sep 16, 2025

Yuval Weiss, David Demitri Africa, Paula Buttery, Richard Diehl Martinez

Figure 1 for Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

Figure 2 for Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

Figure 3 for Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

Figure 4 for Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

Abstract:Parameter-efficient methods such as LoRA have revolutionised the fine-tuning of LLMs. Still, their extension to pretraining via ReLoRA is less well understood, especially for small language models (SLMs), which offer lower computational and environmental costs. This work is the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Through ablation experiments, we find that ReLoRA generally performs worse than standard training on loss, Paloma perplexity and BLiMP, with the gap widening for the larger models. Further analysis of the learning dynamics of the models indicates that ReLoRA reinforces the rank deficiencies found in smaller models. These results indicate that low-rank update strategies may not transfer easily to SLM pretraining, highlighting the need for more research in the low-compute regime.

* 12 Pages, 6 Tables, 8 Figures

Via

Access Paper or Ask Questions

Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Mar 31, 2025

Diana Galvan-Sosa, Gabrielle Gaudeau, Pride Kavumba, Yunmeng Li, Hongyi gu, Zheng Yuan, Keisuke Sakaguchi, Paula Buttery

Figure 1 for Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Figure 2 for Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Figure 3 for Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Figure 4 for Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset

Abstract:The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated using the rubric by both humans and six open- and closed-source LLMs. The CUBE dataset focuses on two reasoning and two language tasks, providing the necessary diversity for us to effectively test our proposed rubric. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice. The full dataset, rubric, and code will be made available upon acceptance.

* 9 main pages (21 appendix pages), 7 figures, submitted to ACL 2025

Via

Access Paper or Ask Questions

Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies

Oct 30, 2024

Suchir Salhan, Richard Diehl Martinez, Zébulon Goriely, Paula Buttery

Abstract:Curriculum Learning has been a popular strategy to improve the cognitive plausibility of Small-Scale Language Models (SSLMs) in the BabyLM Challenge. However, it has not led to considerable improvements over non-curriculum models. We assess whether theoretical linguistic acquisition theories can be used to specify more fine-grained curriculum learning strategies, creating age-ordered corpora of Child-Directed Speech for four typologically distant language families to implement SSLMs and acquisition-inspired curricula cross-lingually. Comparing the success of three objective curricula (Growing, Inwards and MMM) that precisely replicate the predictions of acquisition theories on a standard SSLM architecture, we find fine-grained acquisition-inspired curricula can outperform non-curriculum baselines and performance benefits of curricula strategies in SSLMs can be derived by specifying fine-grained language-specific curricula that precisely replicate language acquisition theories.

* BabyLM Shared Task 2024 (Accepted, Poster), co-located in EMNLP 2024

Via

Access Paper or Ask Questions

From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Oct 30, 2024

Zébulon Goriely, Richard Diehl Martinez, Andrew Caines, Lisa Beinborn, Paula Buttery

Figure 1 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Figure 2 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Figure 3 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Figure 4 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Abstract:Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.

Via

Access Paper or Ask Questions

Tending Towards Stability: Convergence Challenges in Small Language Models

Oct 15, 2024

Richard Diehl Martinez, Pietro Lesci, Paula Buttery

Abstract:Increasing the number of parameters in language models is a common strategy to enhance their performance. However, smaller language models remain valuable due to their lower operational costs. Despite their advantages, smaller models frequently underperform compared to their larger counterparts, even when provided with equivalent data and computational resources. Specifically, their performance tends to degrade in the late pretraining phase. This is anecdotally attributed to their reduced representational capacity. Yet, the exact causes of this performance degradation remain unclear. We use the Pythia model suite to analyse the training dynamics that underlie this phenomenon. Across different model sizes, we investigate the convergence of the Attention and MLP activations to their final state and examine how the effective rank of their parameters influences this process. We find that nearly all layers in larger models stabilise early in training - within the first 20% - whereas layers in smaller models exhibit slower and less stable convergence, especially when their parameters have lower effective rank. By linking the convergence of layers' activations to their parameters' effective rank, our analyses can guide future work to address inefficiencies in the learning dynamics of small models.

Via

Access Paper or Ask Questions

Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Oct 15, 2024

Richard Diehl Martinez, Zebulon Goriely, Andrew Caines, Paula Buttery, Lisa Beinborn

Figure 1 for Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Figure 2 for Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Figure 3 for Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Figure 4 for Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Abstract:Language models strongly rely on frequency information because they maximize the likelihood of tokens during pre-training. As a consequence, language models tend to not generalize well to tokens that are seldom seen during training. Moreover, maximum likelihood training has been discovered to give rise to anisotropy: representations of tokens in a model tend to cluster tightly in a high-dimensional cone, rather than spreading out over their representational capacity. Our work introduces a method for quantifying the frequency bias of a language model by assessing sentence-level perplexity with respect to token-level frequency. We then present a method for reducing the frequency bias of a language model by inducing a syntactic prior over token representations during pre-training. Our Syntactic Smoothing method adjusts the maximum likelihood objective function to distribute the learning signal to syntactically similar tokens. This approach results in better performance on infrequent English tokens and a decrease in anisotropy. We empirically show that the degree of anisotropy in a model correlates with its frequency bias.

Via

Access Paper or Ask Questions

Prompting open-source and commercial language models for grammatical error correction of English learner text

Jan 15, 2024

Christopher Davis, Andrew Caines, Øistein Andersen, Shiva Taslimipoor, Helen Yannakoudakis, Zheng Yuan, Christopher Bryant, Marek Rei, Paula Buttery

Figure 1 for Prompting open-source and commercial language models for grammatical error correction of English learner text

Figure 2 for Prompting open-source and commercial language models for grammatical error correction of English learner text

Figure 3 for Prompting open-source and commercial language models for grammatical error correction of English learner text

Figure 4 for Prompting open-source and commercial language models for grammatical error correction of English learner text

Abstract:Thanks to recent advances in generative AI, we are able to prompt large language models (LLMs) to produce texts which are fluent and grammatical. In addition, it has been shown that we can elicit attempts at grammatical error correction (GEC) from LLMs when prompted with ungrammatical input sentences. We evaluate how well LLMs can perform at GEC by measuring their performance on established benchmark datasets. We go beyond previous studies, which only examined GPT* models on a selection of English GEC datasets, by evaluating seven open-source and three commercial LLMs on four established GEC benchmarks. We investigate model performance and report results against individual error types. Our results indicate that LLMs do not always outperform supervised English GEC models except in specific contexts -- namely commercial LLMs on benchmarks annotated with fluency corrections as opposed to minimal edits. We find that several open-source models outperform commercial ones on minimal edit benchmarks, and that in some settings zero-shot prompting is just as competitive as few-shot prompting.

* 8 pages with appendices

Via

Access Paper or Ask Questions