Picture for Colin Raffel

Colin Raffel

Shammie

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Add code
Jun 26, 2025
Viaarxiv icon

The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions

Add code
Jun 16, 2025
Viaarxiv icon

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Add code
Jun 05, 2025
Viaarxiv icon

Enhancing Training Data Attribution with Representational Optimization

Add code
May 24, 2025
Viaarxiv icon

Position: The Most Expensive Part of an LLM should be its Training Data

Add code
Apr 16, 2025
Viaarxiv icon

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Add code
Feb 04, 2025
Viaarxiv icon

AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution

Add code
Nov 22, 2024
Figure 1 for AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution
Figure 2 for AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution
Figure 3 for AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution
Figure 4 for AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution
Viaarxiv icon

Realistic Evaluation of Model Merging for Compositional Generalization

Add code
Sep 26, 2024
Viaarxiv icon

A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning

Add code
Aug 13, 2024
Figure 1 for A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning
Viaarxiv icon

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Add code
Jun 25, 2024
Viaarxiv icon