Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhiming Huang

First Worst-Case Regret Bounds for Combinatorial Thompson Sampling in Sleeping Semi-Bandits

May 10, 2026

Zhiming Huang, Bingshan Hu, Jianping Pan

Abstract:We revisit combinatorial Thompson sampling (CTS) for semi-bandits with sleeping arms, where arm availability varies over time and actions must satisfy combinatorial constraints, as in wireless mesh routing with fluctuating link availability. Despite its practical relevance, CTS has been hindered by several long-standing problems: (i) the absence of worst-case regret guarantees in the semi-bandit setting even without sleeping arms, (ii) the lack of theory under adversarially varying availability, and (iii) the consistently weak empirical performance of CTS with Gaussian priors (CTS-G). This paper resolves these long-standing issues by providing the first worst-case regret analysis of CTS-G, proving an upper bound of $\tilde{O}(m\sqrt{NT})$ and a matching lower bound of $\tildeΩ(m\sqrt{NT})$. To bridge the gap between theory and practice, we further propose CL-SG, a simple CTS-G variant that samples a single shared Gaussian seed each round to coordinate exploration across arms. We show that CL-SG achieves an improved regret bound of $\tilde{O}(\sqrt{mNT})$, together with a matching lower bound $Ω(\sqrt{mNT})$. Experiments on real-world datasets demonstrate that CL-SG consistently outperforms strong baselines including CTS-G and CTS-B, and we open-source our implementation for reproducibility.

* Accepted by INFOCOM 26 on Dec 2025

Via

Access Paper or Ask Questions

Instance-Adaptive Online Multicalibration

May 10, 2026

Zhiming Huang, Jamie Morgenstern, Aaron Roth, Claire Jie Zhang

Abstract:We study online multicalibration beyond the worst-case. We give a single, efficient algorithm which dynamically interpolates between benign and worst-case sequences by adaptively refining a dyadic grid of prediction values. Its error is controlled by the number of leaves in the refinement tree. Our analysis recovers the known $\widetilde O(T^{2/3})$ worst-case-optimal rate for online multicalibration, while simultaneously automatically adapting to easier instances: in the marginal stochastic setting it obtains a rate of $\widetilde O(\sqrt T)$, and for piecewise-stationary means with $J$ segments its rate is $\widetilde O(\sqrt{JT})$. More generally, the rate depends on a threshold-complexity measure of the predictable mean process relative to the group family. We show that this dependence is tight up to logarithmic factors.

Via

Access Paper or Ask Questions

Connecting Thompson Sampling and UCB: Towards More Efficient Trade-offs Between Privacy and Regret

May 05, 2025

Bingshan Hu, Zhiming Huang, Tianyue H. Zhang, Mathias Lécuyer, Nidhi Hegde

Figure 1 for Connecting Thompson Sampling and UCB: Towards More Efficient Trade-offs Between Privacy and Regret

Figure 2 for Connecting Thompson Sampling and UCB: Towards More Efficient Trade-offs Between Privacy and Regret

Figure 3 for Connecting Thompson Sampling and UCB: Towards More Efficient Trade-offs Between Privacy and Regret

Figure 4 for Connecting Thompson Sampling and UCB: Towards More Efficient Trade-offs Between Privacy and Regret

Abstract:We address differentially private stochastic bandit problems from the angles of exploring the deep connections among Thompson Sampling with Gaussian priors, Gaussian mechanisms, and Gaussian differential privacy (GDP). We propose DP-TS-UCB, a novel parametrized private bandit algorithm that enables to trade off privacy and regret. DP-TS-UCB satisfies $ \tilde{O} \left(T^{0.25(1-\alpha)}\right)$-GDP and enjoys an $O \left(K\ln^{\alpha+1}(T)/\Delta \right)$ regret bound, where $\alpha \in [0,1]$ controls the trade-off between privacy and regret. Theoretically, our DP-TS-UCB relies on anti-concentration bounds of Gaussian distributions and links exploration mechanisms in Thompson Sampling-based algorithms and Upper Confidence Bound-based algorithms, which may be of independent interest.

* Accepted by ICML 2025

Via

Access Paper or Ask Questions

Efficient and Adaptive Posterior Sampling Algorithms for Bandits

May 02, 2024

Bingshan Hu, Zhiming Huang, Tianyue H. Zhang, Mathias Lécuyer, Nidhi Hegde

Figure 1 for Efficient and Adaptive Posterior Sampling Algorithms for Bandits

Figure 2 for Efficient and Adaptive Posterior Sampling Algorithms for Bandits

Figure 3 for Efficient and Adaptive Posterior Sampling Algorithms for Bandits

Abstract:We study Thompson Sampling-based algorithms for stochastic bandits with bounded rewards. As the existing problem-dependent regret bound for Thompson Sampling with Gaussian priors [Agrawal and Goyal, 2017] is vacuous when $T \le 288 e^{64}$, we derive a more practical bound that tightens the coefficient of the leading term %from $288 e^{64}$ to $1270$. Additionally, motivated by large-scale real-world applications that require scalability, adaptive computational resource allocation, and a balance in utility and computation, we propose two parameterized Thompson Sampling-based algorithms: Thompson Sampling with Model Aggregation (TS-MA-$\alpha$) and Thompson Sampling with Timestamp Duelling (TS-TD-$\alpha$), where $\alpha \in [0,1]$ controls the trade-off between utility and computation. Both algorithms achieve $O \left(K\ln^{\alpha+1}(T)/\Delta \right)$ regret bound, where $K$ is the number of arms, $T$ is the finite learning horizon, and $\Delta$ denotes the single round performance loss when pulling a sub-optimal arm.

Via

Access Paper or Ask Questions

Optimal Algorithms for Private Online Learning in a Stochastic Environment

Feb 16, 2021

Bingshan Hu, Zhiming Huang, Nishant A. Mehta

Figure 1 for Optimal Algorithms for Private Online Learning in a Stochastic Environment

Figure 2 for Optimal Algorithms for Private Online Learning in a Stochastic Environment

Figure 3 for Optimal Algorithms for Private Online Learning in a Stochastic Environment

Figure 4 for Optimal Algorithms for Private Online Learning in a Stochastic Environment

Abstract:We consider two variants of private stochastic online learning. The first variant is differentially private stochastic bandits. Previously, Sajed and Sheffet (2019) devised the DP Successive Elimination (DP-SE) algorithm that achieves the optimal $ O \biggl(\sum\limits_{1\le j \le K: \Delta_j >0} \frac{ \log T}{ \Delta_j} + \frac{ K\log T}{\epsilon} \biggr)$ problem-dependent regret bound, where $K$ is the number of arms, $\Delta_j$ is the mean reward gap of arm $j$, $T$ is the time horizon, and $\epsilon$ is the required privacy parameter. However, like other elimination style algorithms, it is not an anytime algorithm. Until now, it was not known whether UCB-based algorithms could achieve this optimal regret bound. We present an anytime, UCB-based algorithm that achieves optimality. Our experiments show that the UCB-based algorithm is competitive with DP-SE. The second variant is the full information version of private stochastic online learning. Specifically, for the problems of decision-theoretic online learning with stochastic rewards, we present the first algorithm that achieves an $ O \left( \frac{ \log K}{ \Delta_{\min}} + \frac{ \log K}{\epsilon} \right)$ regret bound, where $\Delta_{\min}$ is the minimum mean reward gap. The key idea behind our good theoretical guarantees in both settings is the forgetfulness, i.e., decisions are made based on a certain amount of newly obtained observations instead of all the observations obtained from the very beginning.

Via

Access Paper or Ask Questions

Thompson Sampling for Combinatorial Semi-bandits with Sleeping Arms and Long-Term Fairness Constraints

May 14, 2020

Zhiming Huang, Yifan Xu, Bingshan Hu, Qipeng Wang, Jianping Pan

Figure 1 for Thompson Sampling for Combinatorial Semi-bandits with Sleeping Arms and Long-Term Fairness Constraints

Figure 2 for Thompson Sampling for Combinatorial Semi-bandits with Sleeping Arms and Long-Term Fairness Constraints

Figure 3 for Thompson Sampling for Combinatorial Semi-bandits with Sleeping Arms and Long-Term Fairness Constraints

Figure 4 for Thompson Sampling for Combinatorial Semi-bandits with Sleeping Arms and Long-Term Fairness Constraints

Abstract:We study the combinatorial sleeping multi-armed semi-bandit problem with long-term fairness constraints~(CSMAB-F). To address the problem, we adopt Thompson Sampling~(TS) to maximize the total rewards and use virtual queue techniques to handle the fairness constraints, and design an algorithm called \emph{TS with beta priors and Bernoulli likelihoods for CSMAB-F~(TSCSF-B)}. Further, we prove TSCSF-B can satisfy the fairness constraints, and the time-averaged regret is upper bounded by $\frac{N}{2\eta} + O\left(\frac{\sqrt{mNT\ln T}}{T}\right)$, where $N$ is the total number of arms, $m$ is the maximum number of arms that can be pulled simultaneously in each round~(the cardinality constraint) and $\eta$ is the parameter trading off fairness for rewards. By relaxing the fairness constraints (i.e., let $\eta \rightarrow \infty$), the bound boils down to the first problem-independent bound of TS algorithms for combinatorial sleeping multi-armed semi-bandit problems. Finally, we perform numerical experiments and use a high-rating movie recommendation application to show the effectiveness and efficiency of the proposed algorithm.

Via

Access Paper or Ask Questions

Uncertainty measurement with belief entropy on interference effect in Quantum-Like Bayesian Networks

Sep 08, 2017

Zhiming Huang, Lin Yang, Wen Jiang

Figure 1 for Uncertainty measurement with belief entropy on interference effect in Quantum-Like Bayesian Networks

Figure 2 for Uncertainty measurement with belief entropy on interference effect in Quantum-Like Bayesian Networks

Figure 3 for Uncertainty measurement with belief entropy on interference effect in Quantum-Like Bayesian Networks

Figure 4 for Uncertainty measurement with belief entropy on interference effect in Quantum-Like Bayesian Networks

Abstract:Social dilemmas have been regarded as the essence of evolution game theory, in which the prisoner's dilemma game is the most famous metaphor for the problem of cooperation. Recent findings revealed people's behavior violated the Sure Thing Principle in such games. Classic probability methodologies have difficulty explaining the underlying mechanisms of people's behavior. In this paper, a novel quantum-like Bayesian Network was proposed to accommodate the paradoxical phenomenon. The special network can take interference into consideration, which is likely to be an efficient way to describe the underlying mechanism. With the assistance of belief entropy, named as Deng entropy, the paper proposes Belief Distance to render the model practical. Tested with empirical data, the proposed model is proved to be predictable and effective.

* 25 Pages, 7 figures, Revision Submitted to Applied Mathematics and Computations

Via

Access Paper or Ask Questions