Picture for Sainbayar Sukhbaatar

Sainbayar Sukhbaatar

CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

Add code
Jul 31, 2025
Viaarxiv icon

Bridging Offline and Online Reinforcement Learning for LLMs

Add code
Jun 26, 2025
Viaarxiv icon

Multi-Token Attention

Add code
Apr 01, 2025
Figure 1 for Multi-Token Attention
Figure 2 for Multi-Token Attention
Figure 3 for Multi-Token Attention
Figure 4 for Multi-Token Attention
Viaarxiv icon

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

Add code
Mar 19, 2025
Figure 1 for SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
Figure 2 for SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
Figure 3 for SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
Figure 4 for SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks
Viaarxiv icon

Diverse Preference Optimization

Add code
Jan 31, 2025
Figure 1 for Diverse Preference Optimization
Figure 2 for Diverse Preference Optimization
Figure 3 for Diverse Preference Optimization
Figure 4 for Diverse Preference Optimization
Viaarxiv icon

R.I.P.: Better Models by Survival of the Fittest Prompts

Add code
Jan 30, 2025
Figure 1 for R.I.P.: Better Models by Survival of the Fittest Prompts
Figure 2 for R.I.P.: Better Models by Survival of the Fittest Prompts
Figure 3 for R.I.P.: Better Models by Survival of the Fittest Prompts
Figure 4 for R.I.P.: Better Models by Survival of the Fittest Prompts
Viaarxiv icon

Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback

Add code
Jan 18, 2025
Viaarxiv icon

Training Large Language Models to Reason in a Continuous Latent Space

Add code
Dec 09, 2024
Figure 1 for Training Large Language Models to Reason in a Continuous Latent Space
Figure 2 for Training Large Language Models to Reason in a Continuous Latent Space
Figure 3 for Training Large Language Models to Reason in a Continuous Latent Space
Figure 4 for Training Large Language Models to Reason in a Continuous Latent Space
Viaarxiv icon

Adaptive Decoding via Latent Preference Optimization

Add code
Nov 14, 2024
Figure 1 for Adaptive Decoding via Latent Preference Optimization
Figure 2 for Adaptive Decoding via Latent Preference Optimization
Figure 3 for Adaptive Decoding via Latent Preference Optimization
Figure 4 for Adaptive Decoding via Latent Preference Optimization
Viaarxiv icon

Self-Consistency Preference Optimization

Add code
Nov 06, 2024
Figure 1 for Self-Consistency Preference Optimization
Figure 2 for Self-Consistency Preference Optimization
Figure 3 for Self-Consistency Preference Optimization
Figure 4 for Self-Consistency Preference Optimization
Viaarxiv icon