Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Boyu Tian

PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR

May 20, 2026

Yiqi Zhang, Fangzheng Jiao, Tian Tang, Boyu Tian, Hangyu Wang, Qiaoling Chen, Guoteng Wang, Zhen Jiang, Peng Sun, Ping Zhang(+6 more)

Abstract:Reinforcement learning with verifiable rewards (RLVR) has recently unlocked strong reasoning capabilities in large language models (LLMs), triggering rapid exploration of new algorithms and data. However, RLVR training is notoriously inefficient: long-tailed rollouts, tool-induced stalls, and asymmetric resource requirements between rollout and training introduce substantial idle time that cannot be eliminated by job-local optimizations such as synchronous pipelining, asynchronous rollout, or colocated execution. We argue that this inefficiency is structural. While idle gaps are unavoidable within individual RLVR jobs, they are largely anti-correlated across jobs and therefore exploitable at the cluster level. Leveraging this observation, we present PlexRL, a cluster-level runtime for multiplexing unified LLM services across RLVR jobs. By centrally managing model placement, state transitions, and function-level scheduling under strict affinity constraints, PlexRL time-slices LLM execution across jobs to fill otherwise idle periods without expensive model migration. Our implementation and evaluations demonstrate that PlexRL significantly improves effective cluster capacity and reduces user GPU hour cost by maximum 37.58% while preserving algorithmic flexibility and introducing minimal per-job overhead.

Via

Access Paper or Ask Questions

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Feb 06, 2025

Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao

Figure 1 for Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Figure 2 for Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Figure 3 for Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Figure 4 for Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Abstract:Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to $15.4\times$ acceleration in self-attention operations and $3.9\times$ acceleration in end-to-end per token latency in long context LLM decoding.

Via

Access Paper or Ask Questions

Adaptive Unscented Kalman Filter under Minimum Error Entropy with Fiducial Points for Non-Gaussian Systems

Sep 18, 2023

Boyu Tian, Haiquan Zhao

Figure 1 for Adaptive Unscented Kalman Filter under Minimum Error Entropy with Fiducial Points for Non-Gaussian Systems

Figure 2 for Adaptive Unscented Kalman Filter under Minimum Error Entropy with Fiducial Points for Non-Gaussian Systems

Figure 3 for Adaptive Unscented Kalman Filter under Minimum Error Entropy with Fiducial Points for Non-Gaussian Systems

Figure 4 for Adaptive Unscented Kalman Filter under Minimum Error Entropy with Fiducial Points for Non-Gaussian Systems

Abstract:The minimum error entropy (MEE) has been extensively used in unscented Kalman filter (UKF) to handle impulsive noises or abnormal measurement data in non-Gaussian systems. However, the MEE-UKF has poor numerical stability due to the inverse operation of singular matrix. In this paper, a novel UKF based on minimum error entropy with fiducial points (MEEF) is proposed \textcolor{black}{to improve the problem of non-positive definite key matrix. By adding the correntropy to the error entropy, the proposed algorithm further enhances the ability of suppressing impulse noise and outliers. At the same time, considering the uncertainty of noise distribution, the modified Sage-Husa estimator of noise statistics is introduced to adaptively update the noise covariance matrix. In addition, the convergence analysis of the proposed algorithm provides a guidance for the selection of kernel width. The robustness and estimation accuracy of the proposed algorithm are manifested by the state tracking examples under complex non-Gaussian noises.

* Automatica(March 22 2022)
* 29 pages,6 figures

Via

Access Paper or Ask Questions