Picture for Alex Fang

Alex Fang

The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data

Add code
Mar 17, 2026
Viaarxiv icon

ÜberWeb: Insights from Multilingual Curation for a 20-Trillion-Token Dataset

Add code
Feb 16, 2026
Viaarxiv icon

DatBench: Discriminative, Faithful, and Efficient VLM Evaluations

Add code
Jan 05, 2026
Viaarxiv icon

Luxical: High-Speed Lexical-Dense Text Embeddings

Add code
Dec 11, 2025
Viaarxiv icon

Reusing Pre-Training Data at Test Time is a Compute Multiplier

Add code
Nov 06, 2025
Figure 1 for Reusing Pre-Training Data at Test Time is a Compute Multiplier
Figure 2 for Reusing Pre-Training Data at Test Time is a Compute Multiplier
Figure 3 for Reusing Pre-Training Data at Test Time is a Compute Multiplier
Figure 4 for Reusing Pre-Training Data at Test Time is a Compute Multiplier
Viaarxiv icon

Language Models Improve When Pretraining Data Matches Target Tasks

Add code
Jul 16, 2025
Viaarxiv icon

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

Add code
Mar 10, 2025
Viaarxiv icon

DataComp-LM: In search of the next generation of training sets for language models

Add code
Jun 18, 2024
Figure 1 for DataComp-LM: In search of the next generation of training sets for language models
Figure 2 for DataComp-LM: In search of the next generation of training sets for language models
Figure 3 for DataComp-LM: In search of the next generation of training sets for language models
Figure 4 for DataComp-LM: In search of the next generation of training sets for language models
Viaarxiv icon

CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

Add code
May 29, 2024
Viaarxiv icon

URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images

Add code
May 19, 2024
Figure 1 for URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images
Figure 2 for URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images
Figure 3 for URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images
Figure 4 for URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images
Viaarxiv icon