Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vianney Perchet

CREST, ENSAE Paris

Optimal last-iterate convergence in matrix games with bandit feedback using the log-barrier

Apr 16, 2026

Come Fiegel, Pierre Menard, Tadashi Kozuno, Michal Valko, Vianney Perchet

Abstract:We study the problem of learning minimax policies in zero-sum matrix games. Fiegel et al. (2025) recently showed that achieving last-iterate convergence in this setting is harder when the players are uncoupled, by proving a lower bound on the exploitability gap of Omega(t^{-1/4}). Some online mirror descent algorithms were proposed in the literature for this problem, but none have truly attained this rate yet. We show that the use of a log-barrier regularization, along with a dual-focused analysis, allows this O-tilde(t^{-1/4}) convergence with high-probability. We additionally extend our idea to the setting of extensive-form games, proving a bound with the same rate.

Via

Access Paper or Ask Questions

Covariance-adapting algorithm for semi-bandits with application to sparse rewards

Apr 15, 2026

Pierre Perrault, Vianney Perchet, Michal Valko

Abstract:We investigate stochastic combinatorial semi-bandits, where the entire joint distribution of outcomes impacts the complexity of the problem instance (unlike in the standard bandits). Typical distributions considered depend on specific parameter values, whose prior knowledge is required in theory but quite difficult to estimate in practice; an example is the commonly assumed sub-Gaussian family. We alleviate this issue by instead considering a new general family of sub-exponential distributions, which contains bounded and Gaussian ones. We prove a new lower bound on the expected regret on this family, that is parameterized by the unknown covariance matrix of outcomes, a tighter quantity than the sub-Gaussian matrix. We then construct an algorithm that uses covariance estimates, and provide a tight asymptotic analysis of the regret. Finally, we apply and extend our results to the family of sparse outcomes, which has applications in many recommender systems.

* Proceedings of the 33rd Annual Conference on Learning Theory (COLT 2020), PMLR 125, 2020
* Published at Conference on Learning Theory (COLT) 2020

Via

Access Paper or Ask Questions

Learning in Prophet Inequalities with Noisy Observations

Apr 02, 2026

Jung-hun Kim, Vianney Perchet

Abstract:We study the prophet inequality, a fundamental problem in online decision-making and optimal stopping, in a practical setting where rewards are observed only through noisy realizations and reward distributions are unknown. At each stage, the decision-maker receives a noisy reward whose true value follows a linear model with an unknown latent parameter, and observes a feature vector drawn from a distribution. To address this challenge, we propose algorithms that integrate learning and decision-making via lower-confidence-bound (LCB) thresholding. In the i.i.d.\ setting, we establish that both an Explore-then-Decide strategy and an $\varepsilon$-Greedy variant achieve the sharp competitive ratio of $1 - 1/e$, under a mild condition on the optimal value. For non-identical distributions, we show that a competitive ratio of $1/2$ can be guaranteed against a relaxed benchmark. Moreover, with limited window access to past rewards, the tight ratio of $1/2$ against the optimal benchmark is achieved.

* ICLR 2026

Via

Access Paper or Ask Questions

Learning to Allocate Resources with Censored Feedback

Feb 06, 2026

Giovanni Montanari, Côme Fiegel, Corentin Pla, Aadirupa Saha, Vianney Perchet

Abstract:We study the online resource allocation problem in which at each round, a budget $B$ must be allocated across $K$ arms under censored feedback. An arm yields a reward if and only if two conditions are satisfied: (i) the arm is activated according to an arm-specific Bernoulli random variable with unknown parameter, and (ii) the allocated budget exceeds a random threshold drawn from a parametric distribution with unknown parameter. Over $T$ rounds, the learner must jointly estimate the unknown parameters and allocate the budget so as to maximize cumulative reward facing the exploration--exploitation trade-off. We prove an information-theoretic regret lower bound $Ω(T^{1/3})$, demonstrating the intrinsic difficulty of the problem. We then propose RA-UCB, an optimistic algorithm that leverages non-trivial parameter estimation and confidence bounds. When the budget $B$ is known at the beginning of each round, RA-UCB achieves a regret of order $\widetilde{\mathcal{O}}(\sqrt{T})$, and even $\mathcal{O}(\mathrm{poly}\text{-}\log T)$ under stronger assumptions. As for unknown, round dependent budget, we introduce MG-UCB, which allows within-round switching and infinitesimal allocations, and matches the regret guarantees of RA-UCB. We then validate our theoretical results through experiments on real-world datasets.

Via

Access Paper or Ask Questions

Multi-Armed Bandits with Minimum Aggregated Revenue Constraints

Oct 14, 2025

Ahmed Ben Yahmed, Hafedh El Ferchichi, Marc Abeille, Vianney Perchet

Abstract:We examine a multi-armed bandit problem with contextual information, where the objective is to ensure that each arm receives a minimum aggregated reward across contexts while simultaneously maximizing the total cumulative reward. This framework captures a broad class of real-world applications where fair revenue allocation is critical and contextual variation is inherent. The cross-context aggregation of minimum reward constraints, while enabling better performance and easier feasibility, introduces significant technical challenges -- particularly the absence of closed-form optimal allocations typically available in standard MAB settings. We design and analyze algorithms that either optimistically prioritize performance or pessimistically enforce constraint satisfaction. For each algorithm, we derive problem-dependent upper bounds on both regret and constraint violations. Furthermore, we establish a lower bound demonstrating that the dependence on the time horizon in our results is optimal in general and revealing fundamental limitations of the free exploration principle leveraged in prior work.

Via

Access Paper or Ask Questions

Pareto-Optimality, Smoothness, and Stochasticity in Learning-Augmented One-Max-Search

Feb 08, 2025

Ziyad Benomar, Lorenzo Croissant, Vianney Perchet, Spyros Angelopoulos

Figure 1 for Pareto-Optimality, Smoothness, and Stochasticity in Learning-Augmented One-Max-Search

Figure 2 for Pareto-Optimality, Smoothness, and Stochasticity in Learning-Augmented One-Max-Search

Figure 3 for Pareto-Optimality, Smoothness, and Stochasticity in Learning-Augmented One-Max-Search

Figure 4 for Pareto-Optimality, Smoothness, and Stochasticity in Learning-Augmented One-Max-Search

Abstract:One-max search is a classic problem in online decision-making, in which a trader acts on a sequence of revealed prices and accepts one of them irrevocably to maximise its profit. The problem has been studied both in probabilistic and in worst-case settings, notably through competitive analysis, and more recently in learning-augmented settings in which the trader has access to a prediction on the sequence. However, existing approaches either lack smoothness, or do not achieve optimal worst-case guarantees: they do not attain the best possible trade-off between the consistency and the robustness of the algorithm. We close this gap by presenting the first algorithm that simultaneously achieves both of these important objectives. Furthermore, we show how to leverage the obtained smoothness to provide an analysis of one-max search in stochastic learning-augmented settings which capture randomness in both the observed prices and the prediction.

Via

Access Paper or Ask Questions

Strategic Multi-Armed Bandit Problems Under Debt-Free Reporting

Jan 27, 2025

Ahmed Ben Yahmed, Clément Calauzènes, Vianney Perchet

Figure 1 for Strategic Multi-Armed Bandit Problems Under Debt-Free Reporting

Figure 2 for Strategic Multi-Armed Bandit Problems Under Debt-Free Reporting

Figure 3 for Strategic Multi-Armed Bandit Problems Under Debt-Free Reporting

Abstract:We consider the classical multi-armed bandit problem, but with strategic arms. In this context, each arm is characterized by a bounded support reward distribution and strategically aims to maximize its own utility by potentially retaining a portion of its reward, and disclosing only a fraction of it to the learning agent. This scenario unfolds as a game over $T$ rounds, leading to a competition of objectives between the learning agent, aiming to minimize their regret, and the arms, motivated by the desire to maximize their individual utilities. To address these dynamics, we introduce a new mechanism that establishes an equilibrium wherein each arm behaves truthfully and discloses as much of its rewards as possible. With this mechanism, the agent can attain the second-highest average (true) reward among arms, with a cumulative regret bounded by $O(\log(T)/\Delta)$ (problem-dependent) or $O(\sqrt{T\log(T)})$ (worst-case).

Via

Access Paper or Ask Questions

Improved learning rates in multi-unit uniform price auctions

Jan 17, 2025

Marius Potfer, Dorian Baudry, Hugo Richard, Vianney Perchet, Cheng Wan

Figure 1 for Improved learning rates in multi-unit uniform price auctions

Figure 2 for Improved learning rates in multi-unit uniform price auctions

Abstract:Motivated by the strategic participation of electricity producers in electricity day-ahead market, we study the problem of online learning in repeated multi-unit uniform price auctions focusing on the adversarial opposing bid setting. The main contribution of this paper is the introduction of a new modeling of the bid space. Indeed, we prove that a learning algorithm leveraging the structure of this problem achieves a regret of $\tilde{O}(K^{4/3}T^{2/3})$ under bandit feedback, improving over the bound of $\tilde{O}(K^{7/4}T^{3/4})$ previously obtained in the literature. This improved regret rate is tight up to logarithmic terms. Inspired by electricity reserve markets, we further introduce a different feedback model under which all winning bids are revealed. This feedback interpolates between the full-information and bandit scenarios depending on the auctions' results. We prove that, under this feedback, the algorithm that we propose achieves regret $\tilde{O}(K^{5/2}\sqrt{T})$.

* NeurIPS 2024

Via

Access Paper or Ask Questions

Stable Matching with Ties: Approximation Ratios and Learning

Nov 05, 2024

Shiyun Lin, Simon Mauras, Nadav Merlis, Vianney Perchet

Abstract:We study the problem of matching markets with ties, where one side of the market does not necessarily have strict preferences over members at its other side. For example, workers do not always have strict preferences over jobs, students can give the same ranking for different schools and more. In particular, assume w.l.o.g. that workers' preferences are determined by their utility from being matched to each job, which might admit ties. Notably, in contrast to classical two-sided markets with strict preferences, there is no longer a single stable matching that simultaneously maximizes the utility for all workers. We aim to guarantee each worker the largest possible share from the utility in her best possible stable matching. We call the ratio between the worker's best possible stable utility and its assigned utility the \emph{Optimal Stable Share} (OSS)-ratio. We first prove that distributions over stable matchings cannot guarantee an OSS-ratio that is sublinear in the number of workers. Instead, randomizing over possibly non-stable matchings, we show how to achieve a tight logarithmic OSS-ratio. Then, we analyze the case where the real utility is not necessarily known and can only be approximated. In particular, we provide an algorithm that guarantees a similar fraction of the utility compared to the best possible utility. Finally, we move to a bandit setting, where we select a matching at each round and only observe the utilities for matches we perform. We show how to utilize our results for approximate utilities to gracefully interpolate between problems without ties and problems with statistical ties (small suboptimality gaps).

Via

Access Paper or Ask Questions

Strategic Arms with Side Communication Prevail Over Low-Regret MAB Algorithms

Aug 30, 2024

Ahmed Ben Yahmed, Clément Calauzènes, Vianney Perchet

Figure 1 for Strategic Arms with Side Communication Prevail Over Low-Regret MAB Algorithms

Figure 2 for Strategic Arms with Side Communication Prevail Over Low-Regret MAB Algorithms

Figure 3 for Strategic Arms with Side Communication Prevail Over Low-Regret MAB Algorithms

Figure 4 for Strategic Arms with Side Communication Prevail Over Low-Regret MAB Algorithms

Abstract:In the strategic multi-armed bandit setting, when arms possess perfect information about the player's behavior, they can establish an equilibrium where: 1. they retain almost all of their value, 2. they leave the player with a substantial (linear) regret. This study illustrates that, even if complete information is not publicly available to all arms but is shared among them, it is possible to achieve a similar equilibrium. The primary challenge lies in designing a communication protocol that incentivizes the arms to communicate truthfully.

* ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp.7435-7439

Via

Access Paper or Ask Questions