Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaoxi Luo

Yuanpei College, Peking University

Pretraining Language Models on Historical Text

Jun 02, 2026

Xiaoxi Luo, Zachary Shinnick, Niclas Griesshaber, Yixuan Wang, Junchi Yu, Freda Shi, Philip Torr, Yao Lu

Abstract:We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

Via

Access Paper or Ask Questions

Phonetic Reconstruction of the Consonant System of Middle Chinese via Mixed Integer Optimization

Feb 07, 2025

Weiwei Sun, Xiaoxi Luo

Figure 1 for Phonetic Reconstruction of the Consonant System of Middle Chinese via Mixed Integer Optimization

Figure 2 for Phonetic Reconstruction of the Consonant System of Middle Chinese via Mixed Integer Optimization

Figure 3 for Phonetic Reconstruction of the Consonant System of Middle Chinese via Mixed Integer Optimization

Figure 4 for Phonetic Reconstruction of the Consonant System of Middle Chinese via Mixed Integer Optimization

Abstract:This paper is concerned with phonetic reconstruction of the consonant system of Middle Chinese. We propose to cast the problem as a Mixed Integer Programming problem, which is able to automatically explore homophonic information from ancient rhyme dictionaries and phonetic information from modern Chinese dialects, the descendants of Middle Chinese. Numerical evaluation on a wide range of synthetic and real data demonstrates the effectiveness and robustness of the new method. We apply the method to information from Guangyun and 20 modern Chinese dialects to obtain a new phonetic reconstruction result. A linguistically-motivated discussion of this result is also provided.

* accepted by TACL

Via

Access Paper or Ask Questions