Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Milind Tambe

Generative Frontier Planning for Adaptive Peer-Referral Recruitment under Covariate-Dependent Arrivals

Jun 06, 2026

Lingkai Kong, Hezi Jiang, Andrew Ma, Keyu Wang, Akseli Kangaslahti, Milind Tambe

Abstract:Peer-referral recruitment systems such as respondent-driven sampling are critical for studying and intervening on hidden populations affected by infectious diseases. To accelerate recruitment, public health agencies must adaptively allocate limited referral resources across multiple rounds, where current decisions shape both the number and the covariates of future recruits. Prior work makes this problem tractable by assuming that referrals are drawn i.i.d.\ from a homogeneous population, an assumption that ignores the homophily and shared context that drive real peer recruitment. We instead consider a more realistic model in which both referral capacity and the covariates of newly referred individuals are conditioned on the referrer, learned from data with a censored count model and a conditional generative model. The resulting planning problem is challenging because each candidate allocation induces a different distribution over future recruits. We propose \emph{Generative Frontier Planning} (GFP), a model-based planner that replaces per-step Monte-Carlo sampling with a deterministic backup over a latent covariate-coverage value surrogate. The surrogate is designed so that the expected value of the next frontier depends on the offspring generative model only through finite-dimensional summaries that are amortized offline, and so that the resulting per-round objective is monotone with diminishing returns. Together, these two properties make planning tractable: the deterministic backup eliminates Monte-Carlo sampling, and the diminishing-returns structure lets a marginal greedy allocation achieve a $(1-1/e)$-approximation for the per-round problem. On a simulation environment calibrated to a real respondent-driven sampling dataset, GFP outperforms random, reinforcement-learning, and i.i.d.\ dynamic-programming baselines across four discount factors.

Via

Access Paper or Ask Questions

Generating Robust Portfolios of Optimization Models using Large Language Models

May 26, 2026

Eleni Straitouri, Cheol Woo Kim, Milind Tambe

Abstract:Mathematical optimization is a powerful tool for structured decision-making across domains such as resource allocation and planning. Formulating optimization models faithful to reality, though, remains a significant bottleneck as it typically demands both domain expertise and optimization knowledge that are often scarce. Recent advances in large language models (LLMs) promise to bridge this gap, enabling the generation of candidate optimization models from natural language descriptions. However, there is no guarantee that any single LLM-generated model is reliable, and existing approaches that output only one model are therefore risky. In this work, we propose a novel algorithm that generates a portfolio of optimization models, designed to be robust to the limitations of LLMs. Our method exploits the observation that a single LLM can play two distinct roles $\unicode{x2014}$ as a stochastic generator and as a reasoning evaluator $\unicode{x2014}$ and proposes a unified framework that leverages both capabilities in a complementary manner. We provide theoretical guarantees showing that, as long as either the generator or the evaluator is well-aligned with human preferences, the portfolio is guaranteed to contain high-quality candidates, enabling a principled human-in-the-loop process in which a decision-maker can review multiple candidates before committing to one. We further validate our approach empirically, demonstrating strong performance across a range of optimization modeling tasks.

* Accepted at the ICML 2026 LM4Plan Workshop

Via

Access Paper or Ask Questions

Bilevel Optimization of Synthetic Trajectories for Multi-Turn LLM Fine-Tuning

May 23, 2026

Shresth Verma, Mauricio Tec, Cheol Woo Kim, Kai Wang, Milind Tambe

Abstract:While LLMs excel at single-turn generation, they struggle with long-horizon, multi-turn interactions. Offline reinforcement learning (RL) offers a scalable approach, yet its performance hinges on the availability and quality of multi-turn trajectory data. A common remedy is to augment training with synthetic trajectories generated by LLMs or simulators, but synthetic data is highly heterogeneous in quality, and naively treating all trajectories as equally informative can degrade performance. We propose BOOST, a bilevel optimization framework where the inner level trains the LLM on reweighted data and the outer level trains a lightweight reweighting head on held-out real validation tasks, assigning continuous trajectory-level weights without requiring an external judge. To ground this approach, we derive a PAC-Bayesian bound revealing a three-way trade-off: synthetic data increases diversity but risks task-shift, while concentrating weight on high-quality trajectories improves empirical performance at the cost of effective sample size. Empirically, our method consistently outperforms multiple baselines. Analysis reveals it upweights synthetic trajectories that align with the real data distribution and exhibit higher qualitative merit.

Via

Access Paper or Ask Questions

Adaptive Multi-Round Allocation with Stochastic Arrivals

May 12, 2026

Yuqi Pan, Davin Choo, Haichuan Wang, Milind Tambe, Alastair van Heerden, Cheryl Johnson

Abstract:We study a sequential resource allocation problem motivated by adaptive network recruitment, in which a limited budget of identical resources must be allocated over multiple rounds to individuals with stochastic referral capacity. Successful referrals endogenously generate future decision opportunities while allocating additional resources to an individual exhibits diminishing returns. We first show that the single-round allocation problem admits an exact greedy solution based on marginal survival probabilities. In the multi-round setting, the resulting Bellman recursion is intractable due to the stochastic, high-dimensional evolution of the frontier. To address this, we introduce a population-level surrogate value function that depends only on the remaining budget and frontier size. This surrogate enables an exact dynamic program via truncated probability generating functions, yielding a planning algorithm with polynomial complexity in the total budget. We further analyze robustness under model misspecification, proving a multi-round error bound that decomposes into a tight single-round frontier error and a population-level transition error. Finally, we evaluate our method on real-world inspired recruitment scenarios.

* Accepted into ICML 2026

Via

Access Paper or Ask Questions

Efficient Ensemble Selection from Binary and Pairwise Feedback

May 10, 2026

Tzeh Yuan Neoh, Nicholas Teh, Je Qin Chooi, Paul W. Goldberg, Milind Tambe

Abstract:Organizations increasingly deploy multiple AI systems across task domains, but selecting a small, high-performing ensemble can require costly model calls, benchmark runs, and human evaluation. We study this selection problem as a distributional variant of multiwinner voting: tasks are drawn from an unknown domain distribution, each task induces feedback over candidate experts, and a committee's value on a task is determined by its best-performing member. We analyze both binary feedback, for tasks with correct/incorrect outcomes, and pairwise feedback, for tasks where candidate outputs are compared by preference. In the binary setting, the induced objective is coverage. We give exhaustive-elicitation baselines and matching worst-case query lower bounds, and we design a failure-conditioned greedy algorithm that preserves the standard $(1-1/e)$ guarantee while obtaining instance-dependent query savings. In the pairwise setting, we study $θ$-winning committees. We show that full-information optimization admits a PTAS but no EPTAS under Gap-ETH, and that the objective is monotone but not submodular. This motivates a weighted ordinal coverage relaxation, which is submodular and supports a failure-conditioned greedy oracle under pairwise feedback. We then convert this oracle back into $θ$-type guarantees through finite-family auditing or a minimax wrapper. We also provide small-scale LLM experiments illustrating the predicted query savings and the role of complementarity in committee selection.

Via

Access Paper or Ask Questions

Decisions and Deployment: The Five-Year SAHELI Project (2020-2025) on Restless Multi-Armed Bandits for Improving Maternal and Child Health

Apr 08, 2026

Shresth Verma, Arpan Dasgupta, Neha Madhiwalla, Aparna Taneja, Milind Tambe

Abstract:Maternal and child health is a critical concern around the world. In many global health programs disseminating preventive care and health information, limited healthcare worker resources prevent continuous, personalised engagement with vulnerable beneficiaries. In such scenarios, it becomes crucial to optimally schedule limited live-service resources to maximise long-term engagement. To address this fundamental challenge, the multi-year SAHELI project (2020-2025), in collaboration with partner NGO ARMMAN, leverages AI to allocate scarce resources in a maternal and child health program in India. The SAHELI system solves this sequential resource allocation problem using a Restless Multi-Armed Bandit (RMAB) framework. A key methodological innovation is the transition from a traditional Two-Stage "predict-then-optimize" approach to Decision-Focused Learning (DFL), which directly aligns the framework's learning method with the ultimate goal of maximizing beneficiary engagement. Empirical evaluation through large-scale randomized controlled trials demonstrates that the DFL policy reduced cumulative engagement drops by 31% relative to the current standard of care, significantly outperforming the Two-Stage model. Crucially, the studies also confirmed that this increased program engagement translates directly into statistically significant improvements in real-world health behaviors, notably the continued consumption of vital iron and calcium supplements by new mothers. Ultimately, the SAHELI project provides a scalable blueprint for applying sequential decision-making AI to optimize resource allocation in health programs.

Via

Access Paper or Ask Questions

Many Preferences, Few Policies: Towards Scalable Language Model Personalization

Apr 05, 2026

Cheol Woo Kum, Jai Moondra, Roozbeh Nahavandi, Andrew Perrault, Milind Tambe, Swati Gupta

Abstract:The holy grail of LLM personalization is a single LLM for each user, perfectly aligned with that user's preferences. However, maintaining a separate LLM per user is impractical due to constraints on compute, memory, and system complexity. We address this challenge by developing a principled method for selecting a small portfolio of LLMs that captures representative behaviors across heterogeneous users. We model user preferences across multiple traits (e.g., safety, humor, brevity) through a multi-dimensional weight vector. Given reward functions across these dimensions, our algorithm PALM (Portfolio of Aligned LLMs) generates a small portfolio of LLMs such that, for any weight vector, the portfolio contains a near-optimal LLM for the corresponding scalarized objective. To the best of our knowledge, this is the first result that provides theoretical guarantees on both the size and approximation quality of LLM portfolios for personalization. It characterizes the trade-off between system cost and personalization, as well as the diversity of LLMs required to cover the landscape of user preferences. We provide empirical results that validate these guarantees and demonstrate greater output diversity over common baselines.

Via

Access Paper or Ask Questions

LLM Active Alignment: A Nash Equilibrium Perspective

Feb 06, 2026

Tonghan Wang, Yuqi Pan, Xinyi Yang, Yanchen Jiang, Milind Tambe, David C. Parkes

Abstract:We develop a game-theoretic framework for predicting and steering the behavior of populations of large language models (LLMs) through Nash equilibrium (NE) analysis. To avoid the intractability of equilibrium computation in open-ended text spaces, we model each agent's action as a mixture over human subpopulations. Agents choose actively and strategically which groups to align with, yielding an interpretable and behaviorally substantive policy class. We derive closed-form NE characterizations, adopting standard concave-utility assumptions to enable analytical system-level predictions and give explicit, actionable guidance for shifting alignment targets toward socially desirable outcomes. The method functions as an active alignment layer on top of existing alignment pipelines such as RLHF. In a social-media setting, we show that a population of LLMs, especially reasoning-based models, may exhibit political exclusion, pathologies where some subpopulations are ignored by all LLM agents, which can be avoided by our method, illustrating the promise of applying the method to regulate multi-agent LLM dynamics across domains.

Via

Access Paper or Ask Questions

Incentive-Aware AI Safety via Strategic Resource Allocation: A Stackelberg Security Games Perspective

Feb 06, 2026

Cheol Woo Kim, Davin Choo, Tzeh Yuan Neoh, Milind Tambe

Abstract:As AI systems grow more capable and autonomous, ensuring their safety and reliability requires not only model-level alignment but also strategic oversight of the humans and institutions involved in their development and deployment. Existing safety frameworks largely treat alignment as a static optimization problem (e.g., tuning models to desired behavior) while overlooking the dynamic, adversarial incentives that shape how data are collected, how models are evaluated, and how they are ultimately deployed. We propose a new perspective on AI safety grounded in Stackelberg Security Games (SSGs): a class of game-theoretic models designed for adversarial resource allocation under uncertainty. By viewing AI oversight as a strategic interaction between defenders (auditors, evaluators, and deployers) and attackers (malicious actors, misaligned contributors, or worst-case failure modes), SSGs provide a unifying framework for reasoning about incentive design, limited oversight capacity, and adversarial uncertainty across the AI lifecycle. We illustrate how this framework can inform (1) training-time auditing against data/feedback poisoning, (2) pre-deployment evaluation under constrained reviewer resources, and (3) robust multi-model deployment in adversarial environments. This synthesis bridges algorithmic alignment and institutional oversight design, highlighting how game-theoretic deterrence can make AI oversight proactive, risk-aware, and resilient to manipulation.

Via

Access Paper or Ask Questions

Reward Shaping for Inference-Time Alignment: A Stackelberg Game Perspective

Jan 31, 2026

Haichuan Wang, Tao Lin, Lingkai Kong, Ce Li, Hezi Jiang, Milind Tambe

Abstract:Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user's utility because the KL regularization may cause the LLM to inherit the bias in the base policy that conflicts with user preferences. While amplifying rewards for preferred outputs can mitigate this bias, it also increases the risk of reward hacking. This tradeoff motivates the problem of optimally designing reward models under KL regularization. We formalize this reward model optimization problem as a Stackelberg game, and show that a simple reward shaping scheme can effectively approximate the optimal reward model. We empirically evaluate our method in inference-time alignment settings and demonstrate that it integrates seamlessly into existing alignment methods with minimal overhead. Our method consistently improves average reward and achieves win-tie rates exceeding 66% against all baselines, averaged across evaluation settings.

Via

Access Paper or Ask Questions