Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fei Bai

From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR

Aug 11, 2025

Jia Deng, Jie Chen, Zhipeng Chen, Daixuan Cheng, Fei Bai, Beichen Zhang, Yinqian Min, Yanzipeng Gao, Wayne Xin Zhao, Ji-Rong Wen

Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains -- a process critically dependent on effective exploration strategies. While prior work has demonstrated RLVR's empirical success, the fundamental mechanisms governing LLMs' exploration behaviors remain underexplored. This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects: (1) exploration space shaping, where we develop quantitative metrics to characterize LLMs' capability boundaries; (2) entropy-performance exchange, analyzed across training stages, individual instances, and token-level patterns; and (3) RL performance optimization, examining methods to effectively translate exploration gains into measurable improvements. By unifying previously identified insights with new empirical evidence, this work aims to provide a foundational framework for advancing RLVR systems.

* 27pages,25figures. arXiv admin note: text overlap with arXiv:2508.02260

Via

Access Paper or Ask Questions

Towards Effective Code-Integrated Reasoning

May 30, 2025

Fei Bai, Yingqian Min, Beichen Zhang, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen

Abstract:In this paper, we investigate code-integrated reasoning, where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL) through interactive learning. Despite its benefits, tool-augmented RL can still suffer from potential instability in the learning dynamics. In light of this challenge, we present a systematic approach to improving the training effectiveness and stability of tool-augmented RL for code-integrated reasoning. Specifically, we develop enhanced training strategies that balance exploration and stability, progressively building tool-use capabilities while improving reasoning performance. Through extensive experiments on five mainstream mathematical reasoning benchmarks, our model demonstrates significant performance improvements over multiple competitive baselines. Furthermore, we conduct an in-depth analysis of the mechanism and effect of code-integrated reasoning, revealing several key insights, such as the extension of model's capability boundaries and the simultaneous improvement of reasoning efficiency through code integration. All data and code for reproducing this work are available at: https://github.com/RUCAIBox/CIR.

* Technical Report on Slow Thinking with LLMs: Code-Integrated Reasoning

Via

Access Paper or Ask Questions

SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis

May 22, 2025

Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu(+3 more)

Abstract:Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at https://github.com/RUCAIBox/SimpleDeepSearcher.

Via

Access Paper or Ask Questions

Height-Dependent LoS Probability Model for A2G MmWave Communications under Built-up Scenarios

Sep 06, 2021

Minghui Pang, Qiuming Zhu, Fei Bai, Zhuo Li, Hanpeng Li, Kai Mao, Yue Tian

Figure 1 for Height-Dependent LoS Probability Model for A2G MmWave Communications under Built-up Scenarios

Figure 2 for Height-Dependent LoS Probability Model for A2G MmWave Communications under Built-up Scenarios

Figure 3 for Height-Dependent LoS Probability Model for A2G MmWave Communications under Built-up Scenarios

Figure 4 for Height-Dependent LoS Probability Model for A2G MmWave Communications under Built-up Scenarios

Abstract:Based on the three-dimensional propagation characteristic under built-up scenarios, a height-dependent line-of-sight (LoS) probability model for air-to-ground (A2G) millimeter wave (mmWave) communications is proposed in this paper. With comprehensive considerations of scenario factors, i.e., building height distribution, building width, building space, and the heights of transceivers, this paper upgrades the prediction method of International Telecommunication Union-Radio (ITU-R) standard to both low altitude and high altitude cases. In order to speed up the LoS probability prediction, an approximate parametric model is also developed based on the theoretical expression. The simulation results based on ray-tracing (RT) method show that the proposed model has good consistency with existing models at the low altitude. However, it has better performance at the high altitude. The new model can be used for the A2G channel modeling and performance analysis such as cell coverage, outage probability, and bit error rate of A2G communication systems.

* 6 pages, 7 figures, conference

Via

Access Paper or Ask Questions

Geometry-Based Stochastic Line-of-Sight Probability Model for A2G Channels under Urban Scenarios

Sep 06, 2021

Qiuming Zhu, Fei Bai, Minghui Pang, Jie Li, Weizhi Zhong, Xiaomin Chen, Kai Mao

Figure 1 for Geometry-Based Stochastic Line-of-Sight Probability Model for A2G Channels under Urban Scenarios

Figure 2 for Geometry-Based Stochastic Line-of-Sight Probability Model for A2G Channels under Urban Scenarios

Figure 3 for Geometry-Based Stochastic Line-of-Sight Probability Model for A2G Channels under Urban Scenarios

Figure 4 for Geometry-Based Stochastic Line-of-Sight Probability Model for A2G Channels under Urban Scenarios

Abstract:Line-of-sight (LoS) path is essential for the reliability of air-to-ground (A2G) communications, but the existence of LoS path is difficult to predict due to random obstacles on the ground. Based on the statistical geographic information and Fresnel clearance zone, a general stochastic LoS probability model for three-dimensional (3D) A2G channels under urban scenarios is developed. By considering the factors, i.e., building height distribution, building width, building space, carrier frequency, and transceiver's heights, the proposed model is suitable for different frequencies and altitudes. Moreover, in order to get a closed-form expression and reduce the computational complexity, an approximate parametric model is also built with the machine-learning (ML) method to estimate model parameters. The simulation results show that the proposed model has good consistency with existing models at the low altitude. When the altitude increases, it has better performance by comparing with that of the ray-tracing Monte-Carlo simulation data. The analytical results of proposed model are helpful for the channel modeling and performance analysis such as cell coverage, outage probability, and bit error rate in A2G communications.

* 10 pages and 12 figures

Via

Access Paper or Ask Questions