Picture for Guilherme Penedo

Guilherme Penedo

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Add code
Jun 25, 2024
Figure 1 for The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Figure 2 for The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Figure 3 for The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Figure 4 for The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Viaarxiv icon

The Falcon Series of Open Language Models

Add code
Nov 29, 2023
Figure 1 for The Falcon Series of Open Language Models
Figure 2 for The Falcon Series of Open Language Models
Figure 3 for The Falcon Series of Open Language Models
Figure 4 for The Falcon Series of Open Language Models
Viaarxiv icon

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Add code
Jun 01, 2023
Figure 1 for The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Figure 2 for The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Figure 3 for The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Figure 4 for The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Viaarxiv icon