Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Mengxiao Zhang, Haipeng Luo

Contextual multinomial logit (MNL) bandits capture many real-world assortment recommendation problems such as online retailing/advertising. However, prior work has only considered (generalized) linear value functions, which greatly limits its applicability. Motivated by this fact, in this work, we consider contextual MNL bandits with a general value function class that contains the ground truth, borrowing ideas from a recent trend of studies on contextual bandits. Specifically, we consider both the stochastic and the adversarial settings, and propose a suite of algorithms, each with different computation-regret trade-off. When applied to the linear case, our results not only are the first ones with no dependence on a certain problem-dependent constant that can be exponentially large, but also enjoy other advantages such as computational efficiency, dimension-free regret bounds, or the ability to handle completely adversarial contexts and rewards.

Via

Mengxiao Zhang, Yuheng Zhang, Haipeng Luo, Paul Mineiro

Bandits with feedback graphs are powerful online learning models that interpolate between the full information and classic bandit problems, capturing many real-life applications. A recent work by Zhang et al. (2023) studies the contextual version of this problem and proposes an efficient and optimal algorithm via a reduction to online regression. However, their algorithm crucially relies on seeing the feedback graph before making each decision, while in many applications, the feedback graph is uninformed, meaning that it is either only revealed after the learner makes her decision or even never fully revealed at all. This work develops the first contextual algorithm for such uninformed settings, via an efficient reduction to online regression over both the losses and the graphs. Importantly, we show that it is critical to learn the graphs using log loss instead of squared loss to obtain favorable regret guarantees. We also demonstrate the empirical effectiveness of our algorithm on a bidding application using both synthetic and real-world data.

Via

Mengxiao Zhang, Haipeng Luo

We study online learning in contextual pay-per-click auctions where at each of the $T$ rounds, the learner receives some context along with a set of ads and needs to make an estimate on their click-through rate (CTR) in order to run a second-price pay-per-click auction. The learner's goal is to minimize her regret, defined as the gap between her total revenue and that of an oracle strategy that always makes perfect CTR predictions. We first show that $\sqrt{T}$-regret is obtainable via a computationally inefficient algorithm and that it is unavoidable since our algorithm is no easier than the classical multi-armed bandit problem. A by-product of our results is a $\sqrt{T}$-regret bound for the simpler non-contextual setting, improving upon a recent work of [Feng et al., 2023] by removing the inverse CTR dependency that could be arbitrarily large. Then, borrowing ideas from recent advances on efficient contextual bandit algorithms, we develop two practically efficient contextual auction algorithms: the first one uses the exponential weight scheme with optimistic square errors and maintains the same $\sqrt{T}$-regret bound, while the second one reduces the problem to online regression via a simple epsilon-greedy strategy, albeit with a worse regret bound. Finally, we conduct experiments on a synthetic dataset to showcase the effectiveness and superior performance of our algorithms.

Via

Mengxiao Zhang, Fernando Beltran, Jiamou Liu

A data marketplace is an online venue that brings data owners, data brokers, and data consumers together and facilitates commoditisation of data amongst them. Data pricing, as a key function of a data marketplace, demands quantifying the monetary value of data. A considerable number of studies on data pricing can be found in literature. This paper attempts to comprehensively review the state-of-the-art on existing data pricing studies to provide a general understanding of this emerging research area. Our key contribution lies in a new taxonomy of data pricing studies that unifies different attributes determining data prices. The basis of our framework categorises these studies by the kind of market structure, be it sell-side, buy-side, or two-sided. Then in a sell-side market, the studies are further divided by query type, which defines the way a data consumer accesses data, while in a buy-side market, the studies are divided according to privacy notion, which defines the way to quantify privacy of data owners. In a two-sided market, both privacy notion and query type are used as criteria. We systematically examine the studies falling into each category in our taxonomy. Lastly, we discuss gaps within the existing research and define future research directions.

Via

Brendan Lucier, Sarath Pattathil, Aleksandrs Slivkins, Mengxiao Zhang

We study a game between autobidding algorithms that compete in an online advertising platform. Each autobidder is tasked with maximizing its advertiser's total value over multiple rounds of a repeated auction, subject to budget and/or return-on-investment constraints. We propose a gradient-based learning algorithm that is guaranteed to satisfy all constraints and achieves vanishing individual regret. Our algorithm uses only bandit feedback and can be used with the first- or second-price auction, as well as with any "intermediate" auction format. Our main result is that when these autobidders play against each other, the resulting expected liquid welfare over all rounds is at least half of the expected optimal liquid welfare achieved by any allocation. This holds whether or not the bidding dynamics converges to an equilibrium and regardless of the correlation structure between advertiser valuations.

Via

Mengxiao Zhang, Shi Chen, Haipeng Luo, Yingfei Wang

Supply chain management (SCM) has been recognized as an important discipline with applications to many industries, where the two-echelon stochastic inventory model, involving one downstream retailer and one upstream supplier, plays a fundamental role for developing firms' SCM strategies. In this work, we aim at designing online learning algorithms for this problem with an unknown demand distribution, which brings distinct features as compared to classic online optimization problems. Specifically, we consider the two-echelon supply chain model introduced in [Cachon and Zipkin, 1999] under two different settings: the centralized setting, where a planner decides both agents' strategy simultaneously, and the decentralized setting, where two agents decide their strategy independently and selfishly. We design algorithms that achieve favorable guarantees for both regret and convergence to the optimal inventory decision in both settings, and additionally for individual regret in the decentralized setting. Our algorithms are based on Online Gradient Descent and Online Newton Step, together with several new ingredients specifically designed for our problem. We also implement our algorithms and show their empirical effectiveness.

Via

Haipeng Luo, Hanghang Tong, Mengxiao Zhang, Yuheng Zhang

We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result [Neu, 2015] which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of [Alon et al., 2015] by removing the $\mathcal{O}(\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.

Via

Chaofei Hong, Mengwen Yuan, Mengxiao Zhang, Xiao Wang, Chegnjun Zhang, Jiaxin Wang, Gang Pan, Zhaohui Wu, Huajin Tang

Neuromorphic computing is an emerging research field that aims to develop new intelligent systems by integrating theories and technologies from multi-disciplines such as neuroscience and deep learning. Currently, there have been various software frameworks developed for the related fields, but there is a lack of an efficient framework dedicated for spike-based computing models and algorithms. In this work, we present a Python based spiking neural network (SNN) simulation and training framework, aka SPAIC that aims to support brain-inspired model and algorithm researches integrated with features from both deep learning and neuroscience. To integrate different methodologies from the two overwhelming disciplines, and balance between flexibility and efficiency, SPAIC is designed with neuroscience-style frontend and deep learning backend structure. We provide a wide range of examples including neural circuits Simulation, deep SNN learning and neuromorphic applications, demonstrating the concise coding style and wide usability of our framework. The SPAIC is a dedicated spike-based artificial intelligence computing platform, which will significantly facilitate the design, prototype and validation of new models, theories and applications. Being user-friendly, flexible and high-performance, it will help accelerate the rapid growth and wide applicability of neuromorphic computing research.

Via

Haipeng Luo, Mengxiao Zhang, Peng Zhao, Zhi-Hua Zhou

We consider the problem of combining and learning over a set of adversarial bandit algorithms with the goal of adaptively tracking the best one on the fly. The CORRAL algorithm of Agarwal et al. (2017) and its variants (Foster et al., 2020a) achieve this goal with a regret overhead of order $\widetilde{O}(\sqrt{MT})$ where $M$ is the number of base algorithms and $T$ is the time horizon. The polynomial dependence on $M$, however, prevents one from applying these algorithms to many applications where $M$ is poly$(T)$ or even larger. Motivated by this issue, we propose a new recipe to corral a larger band of bandit algorithms whose regret overhead has only \emph{logarithmic} dependence on $M$ as long as some conditions are satisfied. As the main example, we apply our recipe to the problem of adversarial linear bandits over a $d$-dimensional $\ell_p$ unit-ball for $p \in (1,2]$. By corralling a large set of $T$ base algorithms, each starting at a different time step, our final algorithm achieves the first optimal switching regret $\widetilde{O}(\sqrt{d S T})$ when competing against a sequence of comparators with $S$ switches (for some known $S$). We further extend our results to linear bandits over a smooth and strongly convex domain as well as unconstrained linear bandits.

Via

Haipeng Luo, Mengxiao Zhang, Peng Zhao

We consider the problem of adversarial bandit convex optimization, that is, online learning over a sequence of arbitrary convex loss functions with only one function evaluation for each of them. While all previous works assume known and homogeneous curvature on these loss functions, we study a heterogeneous setting where each function has its own curvature that is only revealed after the learner makes a decision. We develop an efficient algorithm that is able to adapt to the curvature on the fly. Specifically, our algorithm not only recovers or \emph{even improves} existing results for several homogeneous settings, but also leads to surprising results for some heterogeneous settings -- for example, while Hazan and Levy (2014) showed that $\widetilde{O}(d^{3/2}\sqrt{T})$ regret is achievable for a sequence of $T$ smooth and strongly convex $d$-dimensional functions, our algorithm reveals that the same is achievable even if $T^{3/4}$ of them are not strongly convex, and sometimes even if a constant fraction of them are not strongly convex. Our approach is inspired by the framework of Bartlett et al. (2007) who studied a similar heterogeneous setting but with stronger gradient feedback. Extending their framework to the bandit feedback setting requires novel ideas such as lifting the feasible domain and using a logarithmically homogeneous self-concordant barrier regularizer.

Via