Picture for Songjun Tu

Songjun Tu

SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

Add code
Jun 24, 2025
Viaarxiv icon

AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

Add code
Jun 17, 2025
Viaarxiv icon

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

Add code
May 16, 2025
Figure 1 for Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Figure 2 for Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Figure 3 for Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Figure 4 for Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Viaarxiv icon

Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Add code
Mar 17, 2025
Figure 1 for Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Figure 2 for Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Figure 3 for Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Figure 4 for Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Viaarxiv icon

Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model

Add code
Dec 22, 2024
Viaarxiv icon

In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning

Add code
Dec 12, 2024
Figure 1 for In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning
Figure 2 for In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning
Figure 3 for In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning
Figure 4 for In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning
Viaarxiv icon