Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Asuman Ozdaglar

Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

Nov 06, 2025

Chanwoo Park, Ziyang Chen, Asuman Ozdaglar, Kaiqing Zhang

Abstract:Large language models (LLMs) are increasingly deployed as "agents" for decision-making (DM) in interactive and dynamic environments. Yet, since they were not originally designed for DM, recent studies show that LLMs can struggle even in basic online DM problems, failing to achieve low regret or an effective exploration-exploitation tradeoff. To address this, we introduce Iterative Regret-Minimization Fine-Tuning (Iterative RMFT), a post-training procedure that repeatedly distills low-regret decision trajectories back into the base model. At each iteration, the model rolls out multiple decision trajectories, selects the k-lowest regret ones, and fine-tunes itself on them. Unlike prior methods that (a) distill action sequences from known DM algorithms or (b) rely on manually crafted chain-of-thought templates, our approach leverages the regret metric to elicit the model's own DM ability and reasoning rationales. This reliance on model-generated reasoning avoids rigid output engineering and provides more flexible, natural-language training signals. Empirical results show that Iterative RMFT improves LLMs' DM performance across diverse models - from Transformers with numerical input/output, to open-weight LLMs, and advanced closed-weight models like GPT-4o mini. Its flexibility in output and reasoning formats enables generalization across tasks with varying horizons, action spaces, reward processes, and natural-language contexts. Finally, we provide theoretical insight showing that a single-layer Transformer under this paradigm can act as a no-regret learner in a simplified setting. Overall, Iterative RMFT offers a principled and general post-training framework for enhancing LLMs' decision-making capabilities.

Via

Access Paper or Ask Questions

Population-Proportional Preference Learning from Human Feedback: An Axiomatic Approach

Jun 05, 2025

Kihyun Kim, Jiawei Zhang, Asuman Ozdaglar, Pablo A. Parrilo

Figure 1 for Population-Proportional Preference Learning from Human Feedback: An Axiomatic Approach

Figure 2 for Population-Proportional Preference Learning from Human Feedback: An Axiomatic Approach

Figure 3 for Population-Proportional Preference Learning from Human Feedback: An Axiomatic Approach

Figure 4 for Population-Proportional Preference Learning from Human Feedback: An Axiomatic Approach

Abstract:Conventional preference learning methods often prioritize opinions held more widely when aggregating preferences from multiple evaluators. This may result in policies that are biased in favor of some types of opinions or groups. The objective of this paper is to develop a novel preference learning framework capable of aligning aggregate opinions and policies proportionally with the true population distribution of evaluator preferences. Our approach infers the feasible set of evaluator population distributions directly from pairwise comparison data. Using these estimates, the algorithm constructs a policy that satisfies foundational axioms from social choice theory, namely monotonicity and Pareto efficiency, as well as our newly-introduced axioms of population-proportional representation and population-bounded robustness. We propose a soft-max relaxation method that smoothly trade-offs population-proportional representation with the selection of the Condorcet winner (which beats all other options in pairwise comparisons). Finally, we validate the effectiveness and scalability of our approach through experiments on both tabular recommendation tasks and large-scale language model alignment.

Via

Access Paper or Ask Questions

What Data Enables Optimal Decisions? An Exact Characterization for Linear Optimization

May 27, 2025

Omar Bennouna, Amine Bennouna, Saurabh Amin, Asuman Ozdaglar

Figure 1 for What Data Enables Optimal Decisions? An Exact Characterization for Linear Optimization

Figure 2 for What Data Enables Optimal Decisions? An Exact Characterization for Linear Optimization

Abstract:We study the fundamental question of how informative a dataset is for solving a given decision-making task. In our setting, the dataset provides partial information about unknown parameters that influence task outcomes. Focusing on linear programs, we characterize when a dataset is sufficient to recover an optimal decision, given an uncertainty set on the cost vector. Our main contribution is a sharp geometric characterization that identifies the directions of the cost vector that matter for optimality, relative to the task constraints and uncertainty set. We further develop a practical algorithm that, for a given task, constructs a minimal or least-costly sufficient dataset. Our results reveal that small, well-chosen datasets can often fully determine optimal decisions -- offering a principled foundation for task-aware data selection.

Via

Access Paper or Ask Questions

UFT: Unifying Supervised and Reinforcement Fine-Tuning

May 22, 2025

Mingyang Liu, Gabriele Farina, Asuman Ozdaglar

Figure 1 for UFT: Unifying Supervised and Reinforcement Fine-Tuning

Figure 2 for UFT: Unifying Supervised and Reinforcement Fine-Tuning

Figure 3 for UFT: Unifying Supervised and Reinforcement Fine-Tuning

Figure 4 for UFT: Unifying Supervised and Reinforcement Fine-Tuning

Abstract:Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

Via

Access Paper or Ask Questions

Differentially Private Equilibrium Finding in Polymatrix Games

Mar 12, 2025

Mingyang Liu, Gabriele Farina, Asuman Ozdaglar

Figure 1 for Differentially Private Equilibrium Finding in Polymatrix Games

Figure 2 for Differentially Private Equilibrium Finding in Polymatrix Games

Figure 3 for Differentially Private Equilibrium Finding in Polymatrix Games

Figure 4 for Differentially Private Equilibrium Finding in Polymatrix Games

Abstract:We study equilibrium finding in polymatrix games under differential privacy constraints. To start, we show that high accuracy and asymptotically vanishing differential privacy budget (as the number of players goes to infinity) cannot be achieved simultaneously under either of the two settings: (i) We seek to establish equilibrium approximation guarantees in terms of Euclidean distance to the equilibrium set, and (ii) the adversary has access to all communication channels. Then, assuming the adversary has access to a constant number of communication channels, we develop a novel distributed algorithm that recovers strategies with simultaneously vanishing Nash gap (in expected utility, also referred to as exploitability and privacy budget as the number of players increases.

Via

Access Paper or Ask Questions

MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning

Feb 25, 2025

Chanwoo Park, Seungju Han, Xingzhi Guo, Asuman Ozdaglar, Kaiqing Zhang, Joo-Kyung Kim

Figure 1 for MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning

Figure 2 for MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning

Figure 3 for MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning

Figure 4 for MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning

Abstract:Leveraging multiple large language models (LLMs) to build collaborative multi-agentic workflows has demonstrated significant potential. However, most previous studies focus on prompting the out-of-the-box LLMs, relying on their innate capability for collaboration, which may not improve LLMs' performance as shown recently. In this paper, we introduce a new post-training paradigm MAPoRL (Multi-Agent Post-co-training for collaborative LLMs with Reinforcement Learning), to explicitly elicit the collaborative behaviors and further unleash the power of multi-agentic LLM frameworks. In MAPoRL, multiple LLMs first generate their own responses independently and engage in a multi-turn discussion to collaboratively improve the final answer. In the end, a MAPoRL verifier evaluates both the answer and the discussion, by assigning a score that verifies the correctness of the answer, while adding incentives to encourage corrective and persuasive discussions. The score serves as the co-training reward, and is then maximized through multi-agent RL. Unlike existing LLM post-training paradigms, MAPoRL advocates the co-training of multiple LLMs together using RL for better generalization. Accompanied by analytical insights, our experiments demonstrate that training individual LLMs alone is insufficient to induce effective collaboration. In contrast, multi-agent co-training can boost the collaboration performance across benchmarks, with generalization to unseen domains.

Via

Access Paper or Ask Questions

Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games

Sep 02, 2024

Zaiwei Chen, Kaiqing Zhang, Eric Mazumdar, Asuman Ozdaglar, Adam Wierman

Abstract:In this paper, we consider two-player zero-sum matrix and stochastic games and develop learning dynamics that are payoff-based, convergent, rational, and symmetric between the two players. Specifically, the learning dynamics for matrix games are based on the smoothed best-response dynamics, while the learning dynamics for stochastic games build upon those for matrix games, with additional incorporation of the minimax value iteration. To our knowledge, our theoretical results present the first finite-sample analysis of such learning dynamics with last-iterate guarantees. In the matrix game setting, the results imply a sample complexity of $O(\epsilon^{-1})$ to find the Nash distribution and a sample complexity of $O(\epsilon^{-8})$ to find a Nash equilibrium. In the stochastic game setting, the results also imply a sample complexity of $O(\epsilon^{-8})$ to find a Nash equilibrium. To establish these results, the main challenge is to handle stochastic approximation algorithms with multiple sets of coupled and stochastic iterates that evolve on (possibly) different time scales. To overcome this challenge, we developed a coupled Lyapunov-based approach, which may be of independent interest to the broader community studying the convergence behavior of stochastic approximation algorithms.

* arXiv admin note: substantial text overlap with arXiv:2303.03100; text overlap with arXiv:2312.08008 by other authors

Via

Access Paper or Ask Questions

A Policy-Gradient Approach to Solving Imperfect-Information Games with Iterate Convergence

Aug 01, 2024

Mingyang Liu, Gabriele Farina, Asuman Ozdaglar

Figure 1 for A Policy-Gradient Approach to Solving Imperfect-Information Games with Iterate Convergence

Figure 2 for A Policy-Gradient Approach to Solving Imperfect-Information Games with Iterate Convergence

Figure 3 for A Policy-Gradient Approach to Solving Imperfect-Information Games with Iterate Convergence

Abstract:Policy gradient methods have become a staple of any single-agent reinforcement learning toolbox, due to their combination of desirable properties: iterate convergence, efficient use of stochastic trajectory feedback, and theoretically-sound avoidance of importance sampling corrections. In multi-agent imperfect-information settings (extensive-form games), however, it is still unknown whether the same desiderata can be guaranteed while retaining theoretical guarantees. Instead, sound methods for extensive-form games rely on approximating counterfactual values (as opposed to Q values), which are incompatible with policy gradient methodologies. In this paper, we investigate whether policy gradient can be safely used in two-player zero-sum imperfect-information extensive-form games (EFGs). We establish positive results, showing for the first time that a policy gradient method leads to provable best-iterate convergence to a regularized Nash equilibrium in self-play.

Via

Access Paper or Ask Questions

LiteEFG: An Efficient Python Library for Solving Extensive-form Games

Jul 29, 2024

Mingyang Liu, Gabriele Farina, Asuman Ozdaglar

Abstract:LiteEFG is an efficient library with easy-to-use Python bindings, which can solve multiplayer extensive-form games (EFGs). LiteEFG enables the user to express computation graphs in Python to define updates on the game tree structure. The graph is then executed by the C++ backend, leading to significant speedups compared to running the algorithm in Python. Moreover, in LiteEFG, the user needs to only specify the computation graph of the update rule in a decision node of the game, and LiteEFG will automatically distribute the update rule to each decision node and handle the structure of the imperfect-information game.

Via

Access Paper or Ask Questions

Finite-Sample Guarantees for Best-Response Learning Dynamics in Zero-Sum Matrix Games

Jul 29, 2024

Fathima Zarin Faizal, Asuman Ozdaglar, Martin J. Wainwright

Abstract:We study best-response type learning dynamics for two player zero-sum matrix games. We consider two settings that are distinguished by the type of information that each player has about the game and their opponent's strategy. The first setting is the full information case, in which each player knows their own and the opponent's payoff matrices and observes the opponent's mixed strategy. The second setting is the minimal information case, where players do not observe the opponent's strategy and are not aware of either of the payoff matrices (instead they only observe their realized payoffs). For this setting, also known as the radically uncoupled case in the learning in games literature, we study a two-timescale learning dynamics that combine smoothed best-response type updates for strategy estimates with a TD-learning update to estimate a local payoff function. For these dynamics, without additional exploration, we provide polynomial-time finite-sample guarantees for convergence to an $\epsilon$-Nash equilibrium.

* 35 pages; under review

Via

Access Paper or Ask Questions