Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Richard Diehl Martinez

SumTablets: A Transliteration Dataset of Sumerian Tablets

Feb 25, 2026

Cole Simmons, Richard Diehl Martinez, Dan Jurafsky

Abstract:Sumerian transliteration is a conventional system for representing a scholar's interpretation of a tablet in the Latin script. Thanks to visionary digital Assyriology projects such as ETCSL, CDLI, and Oracc, a large number of Sumerian transliterations have been published online, and these data are well-structured for a variety of search and analysis tasks. However, the absence of a comprehensive, accessible dataset pairing transliterations with a digital representation of the tablet's cuneiform glyphs has prevented the application of modern Natural Language Processing (NLP) methods to the task of Sumerian transliteration. To address this gap, we present SumTablets, a dataset pairing Unicode representations of 91,606 Sumerian cuneiform tablets (totaling 6,970,407 glyphs) with the associated transliterations published by Oracc. We construct SumTablets by first preprocessing and standardizing the Oracc transliterations before mapping each reading back to the Unicode representation of the source glyph. Further, we retain parallel structural information (e.g., surfaces, newlines, broken segments) through the use of special tokens. We release SumTablets as a Hugging Face Dataset (CC BY 4.0) and open source data preparation code via GitHub. Additionally, we leverage SumTablets to implement and evaluate two transliteration baselines: (1) weighted sampling from a glyph's possible readings, and (2) fine-tuning an autoregressive language model. Our fine-tuned language model achieves an average transliteration character-level F-score (chrF) of 97.55, demonstrating the immediate potential of transformer-based transliteration models in allowing experts to rapidly verify generated transliterations rather than manually transliterating tablets one-by-one.

* Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024), pages 192-202, Hybrid in Bangkok, Thailand and online. Association for Computational Linguistics
* 11 pages with 3 figures

Via

Access Paper or Ask Questions

Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

Sep 16, 2025

Yuval Weiss, David Demitri Africa, Paula Buttery, Richard Diehl Martinez

Abstract:Parameter-efficient methods such as LoRA have revolutionised the fine-tuning of LLMs. Still, their extension to pretraining via ReLoRA is less well understood, especially for small language models (SLMs), which offer lower computational and environmental costs. This work is the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Through ablation experiments, we find that ReLoRA generally performs worse than standard training on loss, Paloma perplexity and BLiMP, with the gap widening for the larger models. Further analysis of the learning dynamics of the models indicates that ReLoRA reinforces the rank deficiencies found in smaller models. These results indicate that low-rank update strategies may not transfer easily to SLM pretraining, highlighting the need for more research in the low-compute regime.

* 12 Pages, 6 Tables, 8 Figures

Via

Access Paper or Ask Questions

From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Oct 30, 2024

Zébulon Goriely, Richard Diehl Martinez, Andrew Caines, Lisa Beinborn, Paula Buttery

Figure 1 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Figure 2 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Figure 3 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Figure 4 for From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Abstract:Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.

Via

Access Paper or Ask Questions

Less is More: Pre-Training Cross-Lingual Small-Scale Language Models with Cognitively-Plausible Curriculum Learning Strategies

Oct 30, 2024

Suchir Salhan, Richard Diehl Martinez, Zébulon Goriely, Paula Buttery

Abstract:Curriculum Learning has been a popular strategy to improve the cognitive plausibility of Small-Scale Language Models (SSLMs) in the BabyLM Challenge. However, it has not led to considerable improvements over non-curriculum models. We assess whether theoretical linguistic acquisition theories can be used to specify more fine-grained curriculum learning strategies, creating age-ordered corpora of Child-Directed Speech for four typologically distant language families to implement SSLMs and acquisition-inspired curricula cross-lingually. Comparing the success of three objective curricula (Growing, Inwards and MMM) that precisely replicate the predictions of acquisition theories on a standard SSLM architecture, we find fine-grained acquisition-inspired curricula can outperform non-curriculum baselines and performance benefits of curricula strategies in SSLMs can be derived by specifying fine-grained language-specific curricula that precisely replicate language acquisition theories.

* BabyLM Shared Task 2024 (Accepted, Poster), co-located in EMNLP 2024

Via

Access Paper or Ask Questions

Tending Towards Stability: Convergence Challenges in Small Language Models

Oct 15, 2024

Richard Diehl Martinez, Pietro Lesci, Paula Buttery

Abstract:Increasing the number of parameters in language models is a common strategy to enhance their performance. However, smaller language models remain valuable due to their lower operational costs. Despite their advantages, smaller models frequently underperform compared to their larger counterparts, even when provided with equivalent data and computational resources. Specifically, their performance tends to degrade in the late pretraining phase. This is anecdotally attributed to their reduced representational capacity. Yet, the exact causes of this performance degradation remain unclear. We use the Pythia model suite to analyse the training dynamics that underlie this phenomenon. Across different model sizes, we investigate the convergence of the Attention and MLP activations to their final state and examine how the effective rank of their parameters influences this process. We find that nearly all layers in larger models stabilise early in training - within the first 20% - whereas layers in smaller models exhibit slower and less stable convergence, especially when their parameters have lower effective rank. By linking the convergence of layers' activations to their parameters' effective rank, our analyses can guide future work to address inefficiencies in the learning dynamics of small models.

Via

Access Paper or Ask Questions

Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Oct 15, 2024

Richard Diehl Martinez, Zebulon Goriely, Andrew Caines, Paula Buttery, Lisa Beinborn

Figure 1 for Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Figure 2 for Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Figure 3 for Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Figure 4 for Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing

Abstract:Language models strongly rely on frequency information because they maximize the likelihood of tokens during pre-training. As a consequence, language models tend to not generalize well to tokens that are seldom seen during training. Moreover, maximum likelihood training has been discovered to give rise to anisotropy: representations of tokens in a model tend to cluster tightly in a high-dimensional cone, rather than spreading out over their representational capacity. Our work introduces a method for quantifying the frequency bias of a language model by assessing sentence-level perplexity with respect to token-level frequency. We then present a method for reducing the frequency bias of a language model by inducing a syntactic prior over token representations during pre-training. Our Syntactic Smoothing method adjusts the maximum likelihood objective function to distribute the learning signal to syntactically similar tokens. This approach results in better performance on infrequent English tokens and a decrease in anisotropy. We empirically show that the degree of anisotropy in a model correlates with its frequency bias.

Via

Access Paper or Ask Questions

CLIMB: Curriculum Learning for Infant-inspired Model Building

Nov 15, 2023

Richard Diehl Martinez, Zebulon Goriely, Hope McGovern, Christopher Davis, Andrew Caines, Paula Buttery, Lisa Beinborn

Abstract:We describe our team's contribution to the STRICT-SMALL track of the BabyLM Challenge. The challenge requires training a language model from scratch using only a relatively small training dataset of ten million words. We experiment with three variants of cognitively-motivated curriculum learning and analyze their effect on the performance of the model on linguistic evaluation tasks. In the vocabulary curriculum, we analyze methods for constraining the vocabulary in the early stages of training to simulate cognitively more plausible learning curves. In the data curriculum experiments, we vary the order of the training instances based on i) infant-inspired expectations and ii) the learning behavior of the model. In the objective curriculum, we explore different variations of combining the conventional masked language modeling task with a more coarse-grained word class prediction task to reinforce linguistic generalization capabilities. Our results did not yield consistent improvements over our own non-curriculum learning baseline across a range of linguistic benchmarks; however, we do find marginal gains on select tasks. Our analysis highlights key takeaways for specific combinations of tasks and settings which benefit from our proposed curricula. We moreover determine that careful selection of model architecture, and training hyper-parameters yield substantial improvements over the default baselines provided by the BabyLM challenge.

Via

Access Paper or Ask Questions

Attention-based Contextual Language Model Adaptation for Speech Recognition

Jun 02, 2021

Richard Diehl Martinez, Scott Novotney, Ivan Bulyko, Ariya Rastrow, Andreas Stolcke, Ankur Gandhe

Figure 1 for Attention-based Contextual Language Model Adaptation for Speech Recognition

Figure 2 for Attention-based Contextual Language Model Adaptation for Speech Recognition

Figure 3 for Attention-based Contextual Language Model Adaptation for Speech Recognition

Figure 4 for Attention-based Contextual Language Model Adaptation for Speech Recognition

Abstract:Language modeling (LM) for automatic speech recognition (ASR) does not usually incorporate utterance level contextual information. For some domains like voice assistants, however, additional context, such as the time at which an utterance was spoken, provides a rich input signal. We introduce an attention mechanism for training neural speech recognition language models on both text and non-linguistic contextual data. When applied to a large de-identified dataset of utterances collected by a popular voice assistant platform, our method reduces perplexity by 7.0% relative over a standard LM that does not incorporate contextual information. When evaluated on utterances extracted from the long tail of the dataset, our method improves perplexity by 9.0% relative over a standard LM and by over 2.8% relative when compared to a state-of-the-art model for contextual LM.

Via

Access Paper or Ask Questions

Automatically Neutralizing Subjective Bias in Text

Dec 12, 2019

Reid Pryzant, Richard Diehl Martinez, Nathan Dass, Sadao Kurohashi, Dan Jurafsky, Diyi Yang

Figure 1 for Automatically Neutralizing Subjective Bias in Text

Figure 2 for Automatically Neutralizing Subjective Bias in Text

Figure 3 for Automatically Neutralizing Subjective Bias in Text

Figure 4 for Automatically Neutralizing Subjective Bias in Text

Abstract:Texts like news, encyclopedias, and some social media strive for objectivity. Yet bias in the form of inappropriate subjectivity - introducing attitudes via framing, presupposing truth, and casting doubt - remains ubiquitous. This kind of bias erodes our collective trust and fuels social conflict. To address this issue, we introduce a novel testbed for natural language generation: automatically bringing inappropriately subjective text into a neutral point of view ("neutralizing" biased text). We also offer the first parallel corpus of biased language. The corpus contains 180,000 sentence pairs and originates from Wikipedia edits that removed various framings, presuppositions, and attitudes from biased sentences. Last, we propose two strong encoder-decoder baselines for the task. A straightforward yet opaque CONCURRENT system uses a BERT encoder to identify subjective words as part of the generation process. An interpretable and controllable MODULAR algorithm separates these steps, using (1) a BERT-based classifier to identify problematic words and (2) a novel join embedding through which the classifier can edit the hidden states of the encoder. Large-scale human evaluation across four domains (encyclopedias, news headlines, books, and political speeches) suggests that these algorithms are a first step towards the automatic identification and reduction of bias.

* To appear at AAAI 2020

Via

Access Paper or Ask Questions

Using General Adversarial Networks for Marketing: A Case Study of Airbnb

Jun 29, 2018

Richard Diehl Martinez, John Kaleialoha Kamalu

Figure 1 for Using General Adversarial Networks for Marketing: A Case Study of Airbnb

Figure 2 for Using General Adversarial Networks for Marketing: A Case Study of Airbnb

Figure 3 for Using General Adversarial Networks for Marketing: A Case Study of Airbnb

Figure 4 for Using General Adversarial Networks for Marketing: A Case Study of Airbnb

Abstract:In this paper, we examine the use case of general adversarial networks (GANs) in the field of marketing. In particular, we analyze how GAN models can replicate text patterns from successful product listings on Airbnb, a peer-to-peer online market for short-term apartment rentals. To do so, we define the Diehl-Martinez-Kamalu (DMK) loss function as a new class of functions that forces the model's generated output to include a set of user-defined keywords. This allows the general adversarial network to recommend a way of rewording the phrasing of a listing description to increase the likelihood that it is booked. Although we tailor our analysis to Airbnb data, we believe this framework establishes a more general model for how generative algorithms can be used to produce text samples for the purposes of marketing.

Via

Access Paper or Ask Questions