Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Nov 13, 2025

Jiahao Wang, Weiye Xu, Aijun Yang, Wengang Zhou, Lewei Lu, Houqiang Li, Xiaohua Wang, Jinguo Zhu

Figure 1 for Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Figure 2 for Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Figure 3 for Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Figure 4 for Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Share this with someone who'll enjoy it:

Abstract:Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.

* Accepted to NeurIPS 2025 (The Thirty-Ninth Annual Conference on Neural Information Processing Systems)

View paper on

Share this with someone who'll enjoy it:

Title:Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Paper and Code