Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yunfan Zhou

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

May 30, 2026

Qiming Shi, Zhaolu Kang, Yunfan Zhou, Di Weng, Yingcai Wu

Abstract:Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent work has improved long-horizon tool-use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real-world queries require discovering a comprehensive set of valid answers, a setting known as Multi-Answer QA. This setting raises two challenges: fine-grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high-frequency entities. We propose SPADER, a reinforcement learning framework for long-horizon tool use in Multi-Answer QA. SPADER includes Step-wise Peer Advantage (SPA), a critic-free step-level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity-aware exploration reward that promotes long-tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting-based agents, outcome-supervised RL methods, and recent step-level supervision approaches. Our code and model weights are available at https://github.com/KhanCold/spader.

Via

Access Paper or Ask Questions

Offline Reinforcement Learning with Adaptive Behavior Regularization

Nov 15, 2022

Yunfan Zhou, Xijun Li, Qingyu Qu

Figure 1 for Offline Reinforcement Learning with Adaptive Behavior Regularization

Figure 2 for Offline Reinforcement Learning with Adaptive Behavior Regularization

Figure 3 for Offline Reinforcement Learning with Adaptive Behavior Regularization

Figure 4 for Offline Reinforcement Learning with Adaptive Behavior Regularization

Abstract:Offline reinforcement learning (RL) defines a sample-efficient learning paradigm, where a policy is learned from static and previously collected datasets without additional interaction with the environment. The major obstacle to offline RL is the estimation error arising from evaluating the value of out-of-distribution actions. To tackle this problem, most existing offline RL methods attempt to acquire a policy both ``close" to the behaviors contained in the dataset and sufficiently improved over them, which requires a trade-off between two possibly conflicting targets. In this paper, we propose a novel approach, which we refer to as adaptive behavior regularization (ABR), to balance this critical trade-off. By simply utilizing a sample-based regularization, ABR enables the policy to adaptively adjust its optimization objective between cloning and improving over the policy used to generate the dataset. In the evaluation on D4RL datasets, a widely adopted benchmark for offline reinforcement learning, ABR can achieve improved or competitive performance compared to existing state-of-the-art algorithms.

* Xijun Li is the corresponding author

Via

Access Paper or Ask Questions

Yordle: An Efficient Imitation Learning for Branch and Bound

Feb 02, 2022

Qingyu Qu, Xijun Li, Yunfan Zhou

Figure 1 for Yordle: An Efficient Imitation Learning for Branch and Bound

Figure 2 for Yordle: An Efficient Imitation Learning for Branch and Bound

Figure 3 for Yordle: An Efficient Imitation Learning for Branch and Bound

Figure 4 for Yordle: An Efficient Imitation Learning for Branch and Bound

Abstract:Combinatorial optimization problems have aroused extensive research interests due to its huge application potential. In practice, there are highly redundant patterns and characteristics during solving the combinatorial optimization problem, which can be captured by machine learning models. Thus, the 2021 NeurIPS Machine Learning for Combinatorial Optimization (ML4CO) competition is proposed with the goal of improving state-of-the-art combinatorial optimization solvers by replacing key heuristic components with machine learning techniques. This work presents our solution and insights gained by team qqy in the dual task of the competition. Our solution is a highly efficient imitation learning framework for performance improvement of Branch and Bound (B&B), named Yordle. It employs a hybrid sampling method and an efficient data selection method, which not only accelerates the model training but also improves the decision quality during branching variable selection. In our experiments, Yordle greatly outperforms the baseline algorithm adopted by the competition while requiring significantly less time and amounts of data to train the decision model. Specifically, we use only 1/4 of the amount of data compared to that required for the baseline algorithm, to achieve around 50% higher score than baseline algorithm. The proposed framework Yordle won the championship of the student leaderboard.

* arXiv admin note: text overlap with arXiv:2201.06213

Via

Access Paper or Ask Questions

An Improved Reinforcement Learning Algorithm for Learning to Branch

Jan 17, 2022

Qingyu Qu, Xijun Li, Yunfan Zhou, Jia Zeng, Mingxuan Yuan, Jie Wang, Jinhu Lv, Kexin Liu, Kun Mao

Abstract:Most combinatorial optimization problems can be formulated as mixed integer linear programming (MILP), in which branch-and-bound (B\&B) is a general and widely used method. Recently, learning to branch has become a hot research topic in the intersection of machine learning and combinatorial optimization. In this paper, we propose a novel reinforcement learning-based B\&B algorithm. Similar to offline reinforcement learning, we initially train on the demonstration data to accelerate learning massively. With the improvement of the training effect, the agent starts to interact with the environment with its learned policy gradually. It is critical to improve the performance of the algorithm by determining the mixing ratio between demonstration and self-generated data. Thus, we propose a prioritized storage mechanism to control this ratio automatically. In order to improve the robustness of the training process, a superior network is additionally introduced based on Double DQN, which always serves as a Q-network with competitive performance. We evaluate the performance of the proposed algorithm over three public research benchmarks and compare it against strong baselines, including three classical heuristics and one state-of-the-art imitation learning-based branching algorithm. The results show that the proposed algorithm achieves the best performance among compared algorithms and possesses the potential to improve B\&B algorithm performance continuously.

Via

Access Paper or Ask Questions