Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dawid J. Kopiczko

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Feb 11, 2026

Dawid J. Kopiczko, Sagar Vaze, Tijmen Blankevoort, Yuki M. Asano

Abstract:Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.

Via

Access Paper or Ask Questions

Bitune: Bidirectional Instruction-Tuning

May 23, 2024

Dawid J. Kopiczko, Tijmen Blankevoort, Yuki M. Asano

Figure 1 for Bitune: Bidirectional Instruction-Tuning

Figure 2 for Bitune: Bidirectional Instruction-Tuning

Figure 3 for Bitune: Bidirectional Instruction-Tuning

Figure 4 for Bitune: Bidirectional Instruction-Tuning

Abstract:We introduce Bitune, a method that improves instruction-tuning of pretrained decoder-only large language models, leading to consistent gains on downstream tasks. Bitune applies both causal and bidirectional attention to the prompt, to obtain a better representation of the query or instruction. We realize this by introducing two sets of parameters, for which we apply parameter-efficient finetuning techniques. These causal and bidirectional features are then combined into a weighted average with trainable coefficients, which is subsequently used to generate new tokens. We demonstrate significant improvements in zero-shot performance on commonsense reasoning, arithmetic, and language understanding tasks, while extensive ablation studies validate the role of each component and demonstrate the method's agnosticism to different PEFT techniques.

Via

Access Paper or Ask Questions