Picture for Catherine Arnett

Catherine Arnett

Weight Tying Biases Token Embeddings Towards the Output Space

Add code
Mar 27, 2026
Viaarxiv icon

How Open Must Language Models be to Enable Reliable Scientific Inference?

Add code
Mar 27, 2026
Viaarxiv icon

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Add code
Jan 25, 2026
Viaarxiv icon

Evaluating Morphological Alignment of Tokenizers in 70 Languages

Add code
Jul 08, 2025
Viaarxiv icon

BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

Add code
May 30, 2025
Viaarxiv icon

On the Acquisition of Shared Grammatical Representations in Bilingual Language Models

Add code
Mar 05, 2025
Figure 1 for On the Acquisition of Shared Grammatical Representations in Bilingual Language Models
Figure 2 for On the Acquisition of Shared Grammatical Representations in Bilingual Language Models
Figure 3 for On the Acquisition of Shared Grammatical Representations in Bilingual Language Models
Figure 4 for On the Acquisition of Shared Grammatical Representations in Bilingual Language Models
Viaarxiv icon

Why do language models perform worse for morphologically complex languages?

Add code
Nov 21, 2024
Viaarxiv icon

Toxicity of the Commons: Curating Open-Source Pre-Training Data

Add code
Oct 29, 2024
Figure 1 for Toxicity of the Commons: Curating Open-Source Pre-Training Data
Figure 2 for Toxicity of the Commons: Curating Open-Source Pre-Training Data
Figure 3 for Toxicity of the Commons: Curating Open-Source Pre-Training Data
Figure 4 for Toxicity of the Commons: Curating Open-Source Pre-Training Data
Viaarxiv icon

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Add code
Sep 06, 2024
Viaarxiv icon

Goldfish: Monolingual Language Models for 350 Languages

Add code
Aug 19, 2024
Viaarxiv icon