Picture for Feiyang Kang

Feiyang Kang

Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls

Add code
Oct 02, 2025
Viaarxiv icon

Quagmires in SFT-RL Post-Training: When High SFT Scores Mislead and What to Use Instead

Add code
Oct 02, 2025
Viaarxiv icon

AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs

Add code
Jul 29, 2024
Figure 1 for AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs
Figure 2 for AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs
Figure 3 for AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs
Figure 4 for AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs
Viaarxiv icon

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Add code
May 05, 2024
Figure 1 for Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs
Figure 2 for Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs
Figure 3 for Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs
Figure 4 for Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs
Viaarxiv icon

FASTTRACK: Fast and Accurate Fact Tracing for LLMs

Add code
Apr 22, 2024
Figure 1 for FASTTRACK: Fast and Accurate Fact Tracing for LLMs
Figure 2 for FASTTRACK: Fast and Accurate Fact Tracing for LLMs
Figure 3 for FASTTRACK: Fast and Accurate Fact Tracing for LLMs
Figure 4 for FASTTRACK: Fast and Accurate Fact Tracing for LLMs
Viaarxiv icon

The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes

Add code
Feb 14, 2024
Figure 1 for The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes
Figure 2 for The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes
Figure 3 for The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes
Figure 4 for The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes
Viaarxiv icon

Data Acquisition: A New Frontier in Data-centric AI

Add code
Nov 22, 2023
Figure 1 for Data Acquisition: A New Frontier in Data-centric AI
Figure 2 for Data Acquisition: A New Frontier in Data-centric AI
Figure 3 for Data Acquisition: A New Frontier in Data-centric AI
Figure 4 for Data Acquisition: A New Frontier in Data-centric AI
Viaarxiv icon

Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources

Add code
Jul 05, 2023
Figure 1 for Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources
Figure 2 for Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources
Figure 3 for Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources
Figure 4 for Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources
Viaarxiv icon

LAVA: Data Valuation without Pre-Specified Learning Algorithms

Add code
Apr 28, 2023
Figure 1 for LAVA: Data Valuation without Pre-Specified Learning Algorithms
Figure 2 for LAVA: Data Valuation without Pre-Specified Learning Algorithms
Figure 3 for LAVA: Data Valuation without Pre-Specified Learning Algorithms
Figure 4 for LAVA: Data Valuation without Pre-Specified Learning Algorithms
Viaarxiv icon