Picture for Guilherme Penedo

Guilherme Penedo

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Add code
Jun 26, 2025
Viaarxiv icon

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Add code
Jun 05, 2025
Viaarxiv icon

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Add code
Feb 04, 2025
Viaarxiv icon

Towards Best Practices for Open Datasets for LLM Training

Add code
Jan 14, 2025
Viaarxiv icon

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Add code
Jun 25, 2024
Viaarxiv icon

The Falcon Series of Open Language Models

Add code
Nov 29, 2023
Figure 1 for The Falcon Series of Open Language Models
Figure 2 for The Falcon Series of Open Language Models
Figure 3 for The Falcon Series of Open Language Models
Figure 4 for The Falcon Series of Open Language Models
Viaarxiv icon

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Add code
Jun 01, 2023
Figure 1 for The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Figure 2 for The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Figure 3 for The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Figure 4 for The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Viaarxiv icon