Picture for Songjun Tu

Songjun Tu

$π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

Add code
Apr 15, 2026
Viaarxiv icon

Saliency-Guided Representation with Consistency Policy Learning for Visual Unsupervised Reinforcement Learning

Add code
Apr 07, 2026
Viaarxiv icon

Dynamic Dual-Granularity Skill Bank for Agentic RL

Add code
Mar 30, 2026
Viaarxiv icon

SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning

Add code
Jun 24, 2025
Viaarxiv icon

AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

Add code
Jun 17, 2025
Viaarxiv icon

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

Add code
May 16, 2025
Figure 1 for Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Figure 2 for Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Figure 3 for Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Figure 4 for Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Viaarxiv icon

Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Add code
Mar 17, 2025
Figure 1 for Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Figure 2 for Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Figure 3 for Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Figure 4 for Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Viaarxiv icon

Online Preference-based Reinforcement Learning with Self-augmented Feedback from Large Language Model

Add code
Dec 22, 2024
Viaarxiv icon

In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning

Add code
Dec 12, 2024
Figure 1 for In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning
Figure 2 for In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning
Figure 3 for In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning
Figure 4 for In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning
Viaarxiv icon