Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zicheng Xu

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Apr 09, 2026

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, Vladimir Braverman

Abstract:On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.

Via

Access Paper or Ask Questions

Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

Aug 29, 2024

Tian Ye, Zicheng Xu, Yuanzhi Li, Zeyuan Allen-Zhu

Figure 1 for Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

Figure 2 for Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

Figure 3 for Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

Figure 4 for Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

Abstract:Language models have demonstrated remarkable performance in solving reasoning tasks; however, even the strongest models still occasionally make reasoning mistakes. Recently, there has been active research aimed at improving reasoning accuracy, particularly by using pretrained language models to "self-correct" their mistakes via multi-round prompting. In this paper, we follow this line of work but focus on understanding the usefulness of incorporating "error-correction" data directly into the pretraining stage. This data consists of erroneous solution steps immediately followed by their corrections. Using a synthetic math dataset, we show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy directly (i.e., through simple auto-regression, without multi-round prompting) compared to pretraining on the same amount of error-free data. We also delve into many details, such as (1) how this approach differs from beam search, (2) how such data can be prepared, (3) whether masking is needed on the erroneous tokens, (4) the amount of error required, (5) whether such data can be deferred to the fine-tuning stage, and many others.

* arXiv admin note: text overlap with arXiv:2407.20311

Via

Access Paper or Ask Questions

Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Jul 29, 2024

Tian Ye, Zicheng Xu, Yuanzhi Li, Zeyuan Allen-Zhu

Figure 1 for Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Figure 2 for Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Figure 3 for Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Figure 4 for Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Abstract:Recent advances in language models have demonstrated their capability to solve mathematical reasoning problems, achieving near-perfect accuracy on grade-school level math benchmarks like GSM8K. In this paper, we formally study how language models solve these problems. We design a series of controlled experiments to address several fundamental questions: (1) Can language models truly develop reasoning skills, or do they simply memorize templates? (2) What is the model's hidden (mental) reasoning process? (3) Do models solve math questions using skills similar to or different from humans? (4) Do models trained on GSM8K-like datasets develop reasoning skills beyond those necessary for solving GSM8K problems? (5) What mental process causes models to make reasoning mistakes? (6) How large or deep must a model be to effectively solve GSM8K-level math questions? Our study uncovers many hidden mechanisms by which language models solve mathematical questions, providing insights that extend beyond current understandings of LLMs.

* video appeared in ICML 2024 tutorial

Via

Access Paper or Ask Questions