Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Improving Continual Pre-training Through Seamless Data Packing

May 29, 2025

Ruicheng Yin, Xuan Gao, Changze Lv, Xiaohua Wang, Xiaoqing Zheng, Xuanjing Huang

Figure 1 for Improving Continual Pre-training Through Seamless Data Packing

Figure 2 for Improving Continual Pre-training Through Seamless Data Packing

Figure 3 for Improving Continual Pre-training Through Seamless Data Packing

Figure 4 for Improving Continual Pre-training Through Seamless Data Packing

Share this with someone who'll enjoy it:

Abstract:Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input texts and splitting them into fixed-length sequences. While straightforward and efficient, this method often leads to excessive truncation and context discontinuity, which can hinder model performance. To address these issues, we explore the potential of data engineering to enhance continual pre-training, particularly its impact on model performance and efficiency. We propose Seamless Packing (SP), a novel data packing strategy aimed at preserving contextual information more effectively and enhancing model performance. Our approach employs a sliding window technique in the first stage that synchronizes overlapping tokens across consecutive sequences, ensuring better continuity and contextual coherence. In the second stage, we adopt a First-Fit-Decreasing algorithm to pack shorter texts into bins slightly larger than the target sequence length, thereby minimizing padding and truncation. Empirical evaluations across various model architectures and corpus domains demonstrate the effectiveness of our method, outperforming baseline method in 99% of all settings. Code is available at https://github.com/Infernus-WIND/Seamless-Packing.

* Accepted to ACL 2025 Findings

View paper on

Share this with someone who'll enjoy it:

Title:Improving Continual Pre-training Through Seamless Data Packing

Paper and Code