Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Muheng Li

What should post-training optimize? A test-time scaling law perspective

May 11, 2026

Muheng Li, Jian Qian, Wenlong Mou

Abstract:Large language models are increasingly deployed with test-time strategies: sample $N$ responses, score them with a reward model or verifier, and return the best. This deployment rule exposes a mismatch in post-training: standard objectives optimize the mean reward of a single response, whereas best-of-$N$ performance is governed by the upper tail of the reward distribution. Recent test-time-aware objectives partly address this mismatch, but typically assume that training can use the same per-prompt rollout budget as deployment, which is impractical when post-training must cover many prompts while deployment can allocate much larger per-prompt test-time compute. We study this budget-mismatch regime, where only $m\ll N$ per-prompt rollouts are available during training but the target objective is best-of-$N$ deployment. Under structural assumptions on the reward tails, we show that the policy gradient of the best-of-$N$ objective can be approximated from a much smaller rollout group by extrapolating upper-tail statistics. This yields a family of Tail-Extrapolated estimators for best-of-$N$-oriented post-training: a simple direct estimator, Tail-Extrapolated Advantage (TEA), and a fixed-order debiased Prefix-TEA estimator based on moment cancellation. Experiments on instruction-following tasks show that TEA and Prefix-TEA improve best-of-$N$ performance across different language models, reward models and datasets under various training and test-time budget settings.

Via

Access Paper or Ask Questions

DoseRAD2026 Challenge dataset: AI accelerated photon and proton dose calculation for radiotherapy

Apr 14, 2026

Fan Xiao, Nikolaos Delopoulos, Niklas Wahl, Lennart Volz, Lina Bucher, Matteo Maspero, Miguel Palacios, Muheng Li, Samir Schulz, Viktor Rogowski(+6 more)

Abstract:Purpose: Accurate dose calculation is essential in radiotherapy for precise tumor irradiation while sparing healthy tissue. With the growing adoption of MRI-guided and real-time adaptive radiotherapy, fast and accurate dose calculation on CT and MRI is increasingly needed. The DoseRAD2026 dataset and challenge provide a public benchmark of paired CT and MRI data with beam-level photon and proton Monte Carlo dose distributions for developing and evaluating advanced dose calculation methods. Acquisition and validation methods: The dataset comprises paired CT and MRI from 115 patients (75 training, 40 testing) treated on an MRI-linac for thoracic or abdominal lesions, derived from the SynthRAD2025 dataset. Pre-processing included deformable image registration, air-cavity correction, and resampling. Ground-truth photon (6 MV) and proton dose distributions were computed using open-source Monte Carlo algorithms, yielding 40,500 photon beams and 81,000 proton beamlets. Data format and usage notes: Data are organized into photon and proton subsets with paired CT-MRI images, beam-level dose distributions, and JSON beam configuration files. Files are provided in compressed MetaImage (.mha) format. The dataset is released under CC BY-NC 4.0, with training data available from April 2026 and the test set withheld until March 2030. Potential applications: The dataset supports benchmarking of fast dose calculation methods, including beam-level dose estimation for photon and proton therapy, MRI-based dose calculation in MRI-guided workflows, and real-time adaptive radiotherapy.

Via

Access Paper or Ask Questions

Predicting and improving test-time scaling laws via reward tail-guided search

Feb 01, 2026

Muheng Li, Jian Qian, Wenlong Mou

Abstract:Test-time scaling has emerged as a critical avenue for enhancing the reasoning capabilities of Large Language Models (LLMs). Though the straight-forward ''best-of-$N$'' (BoN) strategy has already demonstrated significant improvements in performance, it lacks principled guidance on the choice of $N$, budget allocation, and multi-stage decision-making, thereby leaving substantial room for optimization. While many works have explored such optimization, rigorous theoretical guarantees remain limited. In this work, we propose new methodologies to predict and improve scaling properties via tail-guided search. By estimating the tail distribution of rewards, our method predicts the scaling law of LLMs without the need for exhaustive evaluations. Leveraging this prediction tool, we introduce Scaling-Law Guided (SLG) Search, a new test-time algorithm that dynamically allocates compute to identify and exploit intermediate states with the highest predicted potential. We theoretically prove that SLG achieves vanishing regret compared to perfect-information oracles, and achieves expected rewards that would otherwise require a polynomially larger compute budget required when using BoN. Empirically, we validate our framework across different LLMs and reward models, confirming that tail-guided allocation consistently achieves higher reward yields than Best-of-$N$ under identical compute budgets. Our code is available at https://github.com/PotatoJnny/Scaling-Law-Guided-search.

* 33 pages, 5 figures

Via

Access Paper or Ask Questions

Reheated Gradient-based Discrete Sampling for Combinatorial Optimization

Mar 06, 2025

Muheng Li, Ruqi Zhang

Figure 1 for Reheated Gradient-based Discrete Sampling for Combinatorial Optimization

Figure 2 for Reheated Gradient-based Discrete Sampling for Combinatorial Optimization

Figure 3 for Reheated Gradient-based Discrete Sampling for Combinatorial Optimization

Figure 4 for Reheated Gradient-based Discrete Sampling for Combinatorial Optimization

Abstract:Recently, gradient-based discrete sampling has emerged as a highly efficient, general-purpose solver for various combinatorial optimization (CO) problems, achieving performance comparable to or surpassing the popular data-driven approaches. However, we identify a critical issue in these methods, which we term ''wandering in contours''. This behavior refers to sampling new different solutions that share very similar objective values for a long time, leading to computational inefficiency and suboptimal exploration of potential solutions. In this paper, we introduce a novel reheating mechanism inspired by the concept of critical temperature and specific heat in physics, aimed at overcoming this limitation. Empirically, our method demonstrates superiority over existing sampling-based and data-driven algorithms across a diverse array of CO problems.

Via

Access Paper or Ask Questions

Solving Urban Network Security Games: Learning Platform, Benchmark, and Challenge for AI Research

Jan 29, 2025

Shuxin Zhuang, Shuxin Li, Tianji Yang, Muheng Li, Xianjie Shi, Bo An, Youzhi Zhang

Figure 1 for Solving Urban Network Security Games: Learning Platform, Benchmark, and Challenge for AI Research

Figure 2 for Solving Urban Network Security Games: Learning Platform, Benchmark, and Challenge for AI Research

Figure 3 for Solving Urban Network Security Games: Learning Platform, Benchmark, and Challenge for AI Research

Figure 4 for Solving Urban Network Security Games: Learning Platform, Benchmark, and Challenge for AI Research

Abstract:After the great achievement of solving two-player zero-sum games, more and more AI researchers focus on solving multiplayer games. To facilitate the development of designing efficient learning algorithms for solving multiplayer games, we propose a multiplayer game platform for solving Urban Network Security Games (\textbf{UNSG}) that model real-world scenarios. That is, preventing criminal activity is a highly significant responsibility assigned to police officers in cities, and police officers have to allocate their limited security resources to interdict the escaping criminal when a crime takes place in a city. This interaction between multiple police officers and the escaping criminal can be modeled as a UNSG. The variants of UNSGs can model different real-world settings, e.g., whether real-time information is available or not, and whether police officers can communicate or not. The main challenges of solving this game include the large size of the game and the co-existence of cooperation and competition. While previous efforts have been made to tackle UNSGs, they have been hampered by performance and scalability issues. Therefore, we propose an open-source UNSG platform (\textbf{GraphChase}) for designing efficient learning algorithms for solving UNSGs. Specifically, GraphChase offers a unified and flexible game environment for modeling various variants of UNSGs, supporting the development, testing, and benchmarking of algorithms. We believe that GraphChase not only facilitates the development of efficient algorithms for solving real-world problems but also paves the way for significant advancements in algorithmic development for solving general multiplayer games.

Via

Access Paper or Ask Questions

Continuous sPatial-Temporal Deformable Image Registration (CPT-DIR) for motion modelling in radiotherapy: beyond classic voxel-based methods

May 01, 2024

Xia Li, Muheng Li, Antony Lomax, Joachim Buhmann, Ye Zhang

Figure 1 for Continuous sPatial-Temporal Deformable Image Registration (CPT-DIR) for motion modelling in radiotherapy: beyond classic voxel-based methods

Figure 2 for Continuous sPatial-Temporal Deformable Image Registration (CPT-DIR) for motion modelling in radiotherapy: beyond classic voxel-based methods

Figure 3 for Continuous sPatial-Temporal Deformable Image Registration (CPT-DIR) for motion modelling in radiotherapy: beyond classic voxel-based methods

Figure 4 for Continuous sPatial-Temporal Deformable Image Registration (CPT-DIR) for motion modelling in radiotherapy: beyond classic voxel-based methods

Abstract:Background and purpose: Deformable image registration (DIR) is a crucial tool in radiotherapy for extracting and modelling organ motion. However, when significant changes and sliding boundaries are present, it faces compromised accuracy and uncertainty, determining the subsequential contour propagation and dose accumulation procedures. Materials and methods: We propose an implicit neural representation (INR)-based approach modelling motion continuously in both space and time, named Continues-sPatial-Temporal DIR (CPT-DIR). This method uses a multilayer perception (MLP) network to map 3D coordinate (x,y,z) to its corresponding velocity vector (vx,vy,vz). The displacement vectors (dx,dy,dz) are then calculated by integrating velocity vectors over time. The MLP's parameters can rapidly adapt to new cases without pre-training, enhancing optimisation. The DIR's performance was tested on the DIR-Lab dataset of 10 lung 4DCT cases, using metrics of landmark accuracy (TRE), contour conformity (Dice) and image similarity (MAE). Results: The proposed CPT-DIR can reduce landmark TRE from 2.79mm to 0.99mm, outperforming B-splines' results for all cases. The MAE of the whole-body region improves from 35.46HU to 28.99HU. Furthermore, CPT-DIR surpasses B-splines for accuracy in the sliding boundary region, lowering MAE and increasing Dice coefficients for the ribcage from 65.65HU and 90.41% to 42.04HU and 90.56%, versus 75.40HU and 89.30% without registration. Meanwhile, CPT-DIR offers significant speed advantages, completing in under 15 seconds compared to a few minutes with the conventional B-splines method. Conclusion: Leveraging the continuous representations, the CPT-DIR method significantly enhances registration accuracy, automation and speed, outperforming traditional B-splines in landmark and contour precision, particularly in the challenging areas.

Via

Access Paper or Ask Questions

Diffusion Schrödinger Bridge Models for High-Quality MR-to-CT Synthesis for Head and Neck Proton Treatment Planning

Apr 17, 2024

Muheng Li, Xia Li, Sairos Safai, Damien Weber, Antony Lomax, Ye Zhang

Figure 1 for Diffusion Schrödinger Bridge Models for High-Quality MR-to-CT Synthesis for Head and Neck Proton Treatment Planning

Figure 2 for Diffusion Schrödinger Bridge Models for High-Quality MR-to-CT Synthesis for Head and Neck Proton Treatment Planning

Figure 3 for Diffusion Schrödinger Bridge Models for High-Quality MR-to-CT Synthesis for Head and Neck Proton Treatment Planning

Figure 4 for Diffusion Schrödinger Bridge Models for High-Quality MR-to-CT Synthesis for Head and Neck Proton Treatment Planning

Abstract:In recent advancements in proton therapy, MR-based treatment planning is gaining momentum to minimize additional radiation exposure compared to traditional CT-based methods. This transition highlights the critical need for accurate MR-to-CT image synthesis, which is essential for precise proton dose calculations. Our research introduces the Diffusion Schr\"odinger Bridge Models (DSBM), an innovative approach for high-quality MR-to-CT synthesis. DSBM learns the nonlinear diffusion processes between MR and CT data distributions. This method improves upon traditional diffusion models by initiating synthesis from the prior distribution rather than the Gaussian distribution, enhancing both generation quality and efficiency. We validated the effectiveness of DSBM on a head and neck cancer dataset, demonstrating its superiority over traditional image synthesis methods through both image-level and dosimetric-level evaluations. The effectiveness of DSBM in MR-based proton treatment planning highlights its potential as a valuable tool in various clinical scenarios.

* International Conference on the use of Computers in Radiation therapy (ICCR)

Via

Access Paper or Ask Questions

Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

Mar 16, 2024

Zhiheng Li, Muheng Li, Jixuan Fan, Lei Chen, Yansong Tang, Jie Zhou, Jiwen Lu

Figure 1 for Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

Figure 2 for Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

Figure 3 for Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

Figure 4 for Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

Abstract:Scale arbitrary super-resolution based on implicit image function gains increasing popularity since it can better represent the visual world in a continuous manner. However, existing scale arbitrary works are trained and evaluated on simulated datasets, where low-resolution images are generated from their ground truths by the simplest bicubic downsampling. These models exhibit limited generalization to real-world scenarios due to the greater complexity of real-world degradations. To address this issue, we build a RealArbiSR dataset, a new real-world super-resolution benchmark with both integer and non-integer scaling factors for the training and evaluation of real-world scale arbitrary super-resolution. Moreover, we propose a Dual-level Deformable Implicit Representation (DDIR) to solve real-world scale arbitrary super-resolution. Specifically, we design the appearance embedding and deformation field to handle both image-level and pixel-level deformations caused by real-world degradations. The appearance embedding models the characteristics of low-resolution inputs to deal with photometric variations at different scales, and the pixel-based deformation field learns RGB differences which result from the deviations between the real-world and simulated degradations at arbitrary coordinates. Extensive experiments show our trained model achieves state-of-the-art performance on the RealArbiSR and RealSR benchmarks for real-world scale arbitrary super-resolution. Our dataset as well as source code will be publicly available.

Via

Access Paper or Ask Questions

Neural Graphics Primitives-based Deformable Image Registration for On-the-fly Motion Extraction

Feb 08, 2024

Xia Li, Fabian Zhang, Muheng Li, Damien Weber, Antony Lomax, Joachim Buhmann, Ye Zhang

Abstract:Intra-fraction motion in radiotherapy is commonly modeled using deformable image registration (DIR). However, existing methods often struggle to balance speed and accuracy, limiting their applicability in clinical scenarios. This study introduces a novel approach that harnesses Neural Graphics Primitives (NGP) to optimize the displacement vector field (DVF). Our method leverages learned primitives, processed as splats, and interpolates within space using a shallow neural network. Uniquely, it enables self-supervised optimization at an ultra-fast speed, negating the need for pre-training on extensive datasets and allowing seamless adaptation to new cases. We validated this approach on the 4D-CT lung dataset DIR-lab, achieving a target registration error (TRE) of 1.15\pm1.15 mm within a remarkable time of 1.77 seconds. Notably, our method also addresses the sliding boundary problem, a common challenge in conventional DIR methods.

Via

Access Paper or Ask Questions

Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning

Oct 01, 2023

Zhiheng Li, Wenjia Geng, Muheng Li, Lei Chen, Yansong Tang, Jiwen Lu, Jie Zhou

Figure 1 for Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning

Figure 2 for Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning

Figure 3 for Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning

Figure 4 for Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning

Abstract:In this paper, we propose Skip-Plan, a condensed action space learning method for procedure planning in instructional videos. Current procedure planning methods all stick to the state-action pair prediction at every timestep and generate actions adjacently. Although it coincides with human intuition, such a methodology consistently struggles with high-dimensional state supervision and error accumulation on action sequences. In this work, we abstract the procedure planning problem as a mathematical chain model. By skipping uncertain nodes and edges in action chains, we transfer long and complex sequence functions into short but reliable ones in two ways. First, we skip all the intermediate state supervision and only focus on action predictions. Second, we decompose relatively long chains into multiple short sub-chains by skipping unreliable intermediate actions. By this means, our model explores all sorts of reliable sub-relations within an action sequence in the condensed action space. Extensive experiments show Skip-Plan achieves state-of-the-art performance on the CrossTask and COIN benchmarks for procedure planning.

Via

Access Paper or Ask Questions