Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Jul 25, 2024

Tianduo Wang, Shichen Li, Wei Lu

Figure 1 for Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Figure 2 for Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Figure 3 for Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Figure 4 for Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Share this with someone who'll enjoy it:

Abstract:Effective training of language models (LMs) for mathematical reasoning tasks demands high-quality supervised fine-tuning data. Besides obtaining annotations from human experts, a common alternative is sampling from larger and more powerful LMs. However, this knowledge distillation approach can be costly and unstable, particularly when relying on closed-source, proprietary LMs like GPT-4, whose behaviors are often unpredictable. In this work, we demonstrate that the reasoning abilities of small-scale LMs can be enhanced through self-training, a process where models learn from their own outputs. We also show that the conventional self-training can be further augmented by a preference learning algorithm called Direct Preference Optimization (DPO). By integrating DPO into self-training, we leverage preference data to guide LMs towards more accurate and diverse chain-of-thought reasoning. We evaluate our method across various mathematical reasoning tasks using different base models. Our experiments show that this approach not only improves LMs' reasoning performance but also offers a more cost-effective and scalable solution compared to relying on large proprietary LMs.

* ACL 2024. Code and data are available at https://github.com/TianduoWang/DPO-ST

View paper on

Share this with someone who'll enjoy it:

Title:Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Paper and Code