Picture for Elie Bakouch

Elie Bakouch

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

Add code
Apr 15, 2026
Viaarxiv icon

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Add code
Jun 05, 2025
Viaarxiv icon

SmolVLM: Redefining small and efficient multimodal models

Add code
Apr 07, 2025
Figure 1 for SmolVLM: Redefining small and efficient multimodal models
Figure 2 for SmolVLM: Redefining small and efficient multimodal models
Figure 3 for SmolVLM: Redefining small and efficient multimodal models
Figure 4 for SmolVLM: Redefining small and efficient multimodal models
Viaarxiv icon

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Add code
Feb 04, 2025
Figure 1 for SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Figure 2 for SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Figure 3 for SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Figure 4 for SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Viaarxiv icon

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Add code
May 29, 2024
Figure 1 for Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Figure 2 for Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Figure 3 for Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Figure 4 for Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Viaarxiv icon