Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhaoran Wang

What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

May 30, 2023

Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, Zhaoran Wang

Figure 1 for What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

Abstract:In this paper, we conduct a comprehensive study of In-Context Learning (ICL) by addressing several open questions: (a) What type of ICL estimator is learned within language models? (b) What are suitable performance metrics to evaluate ICL accurately and what are the error rates? (c) How does the transformer architecture enable ICL? To answer (a), we take a Bayesian view and demonstrate that ICL implicitly implements the Bayesian model averaging algorithm. This Bayesian model averaging algorithm is proven to be approximately parameterized by the attention mechanism. For (b), we analyze the ICL performance from an online learning perspective and establish a regret bound $\mathcal{O}(1/T)$, where $T$ is the ICL input sequence length. To address (c), in addition to the encoded Bayesian model averaging algorithm in attention, we show that during pertaining, the total variation distance between the learned model and the nominal model is bounded by a sum of an approximation error and a generalization error of $\tilde{\mathcal{O}}(1/\sqrt{N_{\mathrm{p}}T_{\mathrm{p}}})$, where $N_{\mathrm{p}}$ and $T_{\mathrm{p}}$ are the number of token sequences and the length of each sequence in pretraining, respectively. Our results provide a unified understanding of the transformer and its ICL ability with bounds on ICL regret, approximation, and generalization, which deepens our knowledge of these essential aspects of modern language models.

Via

Access Paper or Ask Questions

One Objective to Rule Them All: A Maximization Objective Fusing Estimation and Planning for Exploration

May 29, 2023

Zhihan Liu, Miao Lu, Wei Xiong, Han Zhong, Hao Hu, Shenao Zhang, Sirui Zheng, Zhuoran Yang, Zhaoran Wang

Figure 1 for One Objective to Rule Them All: A Maximization Objective Fusing Estimation and Planning for Exploration

Figure 2 for One Objective to Rule Them All: A Maximization Objective Fusing Estimation and Planning for Exploration

Figure 3 for One Objective to Rule Them All: A Maximization Objective Fusing Estimation and Planning for Exploration

Figure 4 for One Objective to Rule Them All: A Maximization Objective Fusing Estimation and Planning for Exploration

Abstract:In online reinforcement learning (online RL), balancing exploration and exploitation is crucial for finding an optimal policy in a sample-efficient way. To achieve this, existing sample-efficient online RL algorithms typically consist of three components: estimation, planning, and exploration. However, in order to cope with general function approximators, most of them involve impractical algorithmic components to incentivize exploration, such as optimization within data-dependent level-sets or complicated sampling procedures. To address this challenge, we propose an easy-to-implement RL framework called \textit{Maximize to Explore} (\texttt{MEX}), which only needs to optimize \emph{unconstrainedly} a single objective that integrates the estimation and planning components while balancing exploration and exploitation automatically. Theoretically, we prove that \texttt{MEX} achieves a sublinear regret with general function approximations for Markov decision processes (MDP) and is further extendable to two-player zero-sum Markov games (MG). Meanwhile, we adapt deep RL baselines to design practical versions of \texttt{MEX}, in both model-free and model-based manners, which can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards. Compared with existing sample-efficient online RL algorithms with general function approximations, \texttt{MEX} achieves similar sample efficiency while enjoying a lower computational cost and is more compatible with modern deep RL methods.

Via

Access Paper or Ask Questions

Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning

May 08, 2023

Yulai Zhao, Zhuoran Yang, Zhaoran Wang, Jason D. Lee

Figure 1 for Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning

Abstract:Policy optimization methods with function approximation are widely used in multi-agent reinforcement learning. However, it remains elusive how to design such algorithms with statistical guarantees. Leveraging a multi-agent performance difference lemma that characterizes the landscape of multi-agent policy optimization, we find that the localized action value function serves as an ideal descent direction for each local policy. Motivated by the observation, we present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO. We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate. We extend our algorithm to the off-policy setting and introduce pessimism to policy evaluation, which aligns with experiments. To our knowledge, this is the first provably convergent multi-agent PPO algorithm in cooperative Markov games.

* ICML 2023

Via

Access Paper or Ask Questions

Dynamic Datasets and Market Environments for Financial Reinforcement Learning

Apr 25, 2023

Xiao-Yang Liu, Ziyi Xia, Hongyang Yang, Jiechao Gao, Daochen Zha, Ming Zhu, Christina Dan Wang, Zhaoran Wang, Jian Guo

Figure 1 for Dynamic Datasets and Market Environments for Financial Reinforcement Learning

Figure 2 for Dynamic Datasets and Market Environments for Financial Reinforcement Learning

Figure 3 for Dynamic Datasets and Market Environments for Financial Reinforcement Learning

Figure 4 for Dynamic Datasets and Market Environments for Financial Reinforcement Learning

Abstract:The financial market is a particularly challenging playground for deep reinforcement learning due to its unique feature of dynamic datasets. Building high-quality market environments for training financial reinforcement learning (FinRL) agents is difficult due to major factors such as the low signal-to-noise ratio of financial data, survivorship bias of historical data, and model overfitting. In this paper, we present FinRL-Meta, a data-centric and openly accessible library that processes dynamic datasets from real-world markets into gym-style market environments and has been actively maintained by the AI4Finance community. First, following a DataOps paradigm, we provide hundreds of market environments through an automatic data curation pipeline. Second, we provide homegrown examples and reproduce popular research papers as stepping stones for users to design new trading strategies. We also deploy the library on cloud platforms so that users can visualize their own results and assess the relative performance via community-wise competitions. Third, we provide dozens of Jupyter/Python demos organized into a curriculum and a documentation website to serve the rapidly growing community. The open-source codes for the data curation pipeline are available at https://github.com/AI4Finance-Foundation/FinRL-Meta

* 49 pages, 15 figures. arXiv admin note: substantial text overlap with arXiv:2211.03107

Via

Access Paper or Ask Questions

Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

Mar 28, 2023

Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, Xianyuan Zhan

Figure 1 for Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

Figure 2 for Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

Figure 3 for Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

Figure 4 for Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

Abstract:Most offline reinforcement learning (RL) methods suffer from the trade-off between improving the policy to surpass the behavior policy and constraining the policy to limit the deviation from the behavior policy as computing $Q$-values using out-of-distribution (OOD) actions will suffer from errors due to distributional shift. The recently proposed \textit{In-sample Learning} paradigm (i.e., IQL), which improves the policy by quantile regression using only data samples, shows great promise because it learns an optimal policy without querying the value function of any unseen actions. However, it remains unclear how this type of method handles the distributional shift in learning the value function. In this work, we make a key finding that the in-sample learning paradigm arises under the \textit{Implicit Value Regularization} (IVR) framework. This gives a deeper understanding of why the in-sample learning paradigm works, i.e., it applies implicit value regularization to the policy. Based on the IVR framework, we further propose two practical algorithms, Sparse $Q$-learning (SQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works, but in a complete in-sample manner. Compared with IQL, we find that our algorithms introduce sparsity in learning the value function, making them more robust in noisy data regimes. We also verify the effectiveness of SQL and EQL on D4RL benchmark datasets and show the benefits of in-sample learning by comparing them with CQL in small data regimes.

* ICLR 2023 notable top 5%

Via

Access Paper or Ask Questions

A Unified Framework of Policy Learning for Contextual Bandit with Confounding Bias and Missing Observations

Mar 20, 2023

Siyu Chen, Yitan Wang, Zhaoran Wang, Zhuoran Yang

Abstract:We study the offline contextual bandit problem, where we aim to acquire an optimal policy using observational data. However, this data usually contains two deficiencies: (i) some variables that confound actions are not observed, and (ii) missing observations exist in the collected data. Unobserved confounders lead to a confounding bias and missing observations cause bias and inefficiency problems. To overcome these challenges and learn the optimal policy from the observed dataset, we present a new algorithm called Causal-Adjusted Pessimistic (CAP) policy learning, which forms the reward function as the solution of an integral equation system, builds a confidence set, and greedily takes action with pessimism. With mild assumptions on the data, we develop an upper bound to the suboptimality of CAP for the offline contextual bandit problem.

* 76 page, 5 figures

Via

Access Paper or Ask Questions

Finding Regularized Competitive Equilibria of Heterogeneous Agent Macroeconomic Models with Reinforcement Learning

Feb 24, 2023

Ruitu Xu, Yifei Min, Tianhao Wang, Zhaoran Wang, Michael I. Jordan, Zhuoran Yang

Figure 1 for Finding Regularized Competitive Equilibria of Heterogeneous Agent Macroeconomic Models with Reinforcement Learning

Figure 2 for Finding Regularized Competitive Equilibria of Heterogeneous Agent Macroeconomic Models with Reinforcement Learning

Abstract:We study a heterogeneous agent macroeconomic model with an infinite number of households and firms competing in a labor market. Each household earns income and engages in consumption at each time step while aiming to maximize a concave utility subject to the underlying market conditions. The households aim to find the optimal saving strategy that maximizes their discounted cumulative utility given the market condition, while the firms determine the market conditions through maximizing corporate profit based on the household population behavior. The model captures a wide range of applications in macroeconomic studies, and we propose a data-driven reinforcement learning framework that finds the regularized competitive equilibrium of the model. The proposed algorithm enjoys theoretical guarantees in converging to the equilibrium of the market at a sub-linear rate.

* 44 pages

Via

Access Paper or Ask Questions

Differentiable Arbitrating in Zero-sum Markov Games

Feb 20, 2023

Jing Wang, Meichen Song, Feng Gao, Boyi Liu, Zhaoran Wang, Yi Wu

Figure 1 for Differentiable Arbitrating in Zero-sum Markov Games

Figure 2 for Differentiable Arbitrating in Zero-sum Markov Games

Figure 3 for Differentiable Arbitrating in Zero-sum Markov Games

Figure 4 for Differentiable Arbitrating in Zero-sum Markov Games

Abstract:We initiate the study of how to perturb the reward in a zero-sum Markov game with two players to induce a desirable Nash equilibrium, namely arbitrating. Such a problem admits a bi-level optimization formulation. The lower level requires solving the Nash equilibrium under a given reward function, which makes the overall problem challenging to optimize in an end-to-end way. We propose a backpropagation scheme that differentiates through the Nash equilibrium, which provides the gradient feedback for the upper level. In particular, our method only requires a black-box solver for the (regularized) Nash equilibrium (NE). We develop the convergence analysis for the proposed framework with proper black-box NE solvers and demonstrate the empirical successes in two multi-agent reinforcement learning (MARL) environments.

Via

Access Paper or Ask Questions

An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Dec 30, 2022

Yufeng Zhang, Boyi Liu, Qi Cai, Lingxiao Wang, Zhaoran Wang

Figure 1 for An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Figure 2 for An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Figure 3 for An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Figure 4 for An Analysis of Attention via the Lens of Exchangeability and Latent Variable Models

Abstract:With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks.

* 76 pages

Via

Access Paper or Ask Questions

Offline Policy Optimization in RL with Variance Regularizaton

Dec 29, 2022

Riashat Islam, Samarth Sinha, Homanga Bharadhwaj, Samin Yeasar Arnob, Zhuoran Yang, Animesh Garg, Zhaoran Wang, Lihong Li, Doina Precup

Figure 1 for Offline Policy Optimization in RL with Variance Regularizaton

Figure 2 for Offline Policy Optimization in RL with Variance Regularizaton

Figure 3 for Offline Policy Optimization in RL with Variance Regularizaton

Figure 4 for Offline Policy Optimization in RL with Variance Regularizaton

Abstract:Learning policies from fixed offline datasets is a key challenge to scale up reinforcement learning (RL) algorithms towards practical applications. This is often because off-policy RL algorithms suffer from distributional shift, due to mismatch between dataset and the target policy, leading to high variance and over-estimation of value functions. In this work, we propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms. We show that the regularizer leads to a lower bound to the offline policy optimization objective, which can help avoid over-estimation errors, and explains the benefits of our approach across a range of continuous control domains when compared to existing state-of-the-art algorithms.

* Old Draft, Offline RL Workshop, NeurIPS'20;

Via

Access Paper or Ask Questions