Picture for Colin Raffel

Colin Raffel

Shammie

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

Add code
Apr 15, 2026
Viaarxiv icon

Model Merging via Data-Free Covariance Estimation

Add code
Apr 01, 2026
Viaarxiv icon

The Appeal and Reality of Recycling LoRAs with Adaptive Merging

Add code
Feb 12, 2026
Viaarxiv icon

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Add code
Jan 29, 2026
Viaarxiv icon

Efficiently Estimating Data Efficiency for Language Model Fine-tuning

Add code
Dec 31, 2025
Viaarxiv icon

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Add code
Dec 23, 2025
Viaarxiv icon

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Add code
Jun 26, 2025
Viaarxiv icon

The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions

Add code
Jun 16, 2025
Viaarxiv icon

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Add code
Jun 05, 2025
Viaarxiv icon

Enhancing Training Data Attribution with Representational Optimization

Add code
May 24, 2025
Viaarxiv icon