Alert button
Picture for Zhengling Qi

Zhengling Qi

Alert button

Off-policy Evaluation in Doubly Inhomogeneous Environments

Jun 14, 2023
Zeyu Bian, Chengchun Shi, Zhengling Qi, Lan Wang

Figure 1 for Off-policy Evaluation in Doubly Inhomogeneous Environments
Figure 2 for Off-policy Evaluation in Doubly Inhomogeneous Environments
Figure 3 for Off-policy Evaluation in Doubly Inhomogeneous Environments
Figure 4 for Off-policy Evaluation in Doubly Inhomogeneous Environments

This work aims to study off-policy evaluation (OPE) under scenarios where two key reinforcement learning (RL) assumptions -- temporal stationarity and individual homogeneity are both violated. To handle the ``double inhomogeneities", we propose a class of latent factor models for the reward and observation transition functions, under which we develop a general OPE framework that consists of both model-based and model-free approaches. To our knowledge, this is the first paper that develops statistically sound OPE methods in offline RL with double inhomogeneities. It contributes to a deeper understanding of OPE in environments, where standard RL assumptions are not met, and provides several practical approaches in these settings. We establish the theoretical properties of the proposed value estimators and empirically show that our approach outperforms competing methods that ignore either temporal nonstationarity or individual heterogeneity. Finally, we illustrate our method on a data set from the Medical Information Mart for Intensive Care.

Viaarxiv icon

A Policy Gradient Method for Confounded POMDPs

May 26, 2023
Mao Hong, Zhengling Qi, Yanxun Xu

Figure 1 for A Policy Gradient Method for Confounded POMDPs

In this paper, we propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting. We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data. The identification enables us to solve a sequence of conditional moment restrictions and adopt the min-max learning procedure with general function approximation for estimating the policy gradient. We then provide a finite-sample non-asymptotic bound for estimating the gradient uniformly over a pre-specified policy class in terms of the sample size, length of horizon, concentratability coefficient and the measure of ill-posedness in solving the conditional moment restrictions. Lastly, by deploying the proposed gradient estimation in the gradient ascent algorithm, we show the global convergence of the proposed algorithm in finding the history-dependent optimal policy under some technical conditions. To the best of our knowledge, this is the first work studying the policy gradient method for POMDPs under the offline setting.

* 84 pages, 1 figure 
Viaarxiv icon

Sequential Knockoffs for Variable Selection in Reinforcement Learning

Mar 24, 2023
Tao Ma, Hengrui Cai, Zhengling Qi, Chengchun Shi, Eric B. Laber

Figure 1 for Sequential Knockoffs for Variable Selection in Reinforcement Learning
Figure 2 for Sequential Knockoffs for Variable Selection in Reinforcement Learning
Figure 3 for Sequential Knockoffs for Variable Selection in Reinforcement Learning
Figure 4 for Sequential Knockoffs for Variable Selection in Reinforcement Learning

In real-world applications of reinforcement learning, it is often challenging to obtain a state representation that is parsimonious and satisfies the Markov property without prior knowledge. Consequently, it is common practice to construct a state which is larger than necessary, e.g., by concatenating measurements over contiguous time points. However, needlessly increasing the dimension of the state can slow learning and obfuscate the learned policy. We introduce the notion of a minimal sufficient state in a Markov decision process (MDP) as the smallest subvector of the original state under which the process remains an MDP and shares the same optimal policy as the original process. We propose a novel sequential knockoffs (SEEK) algorithm that estimates the minimal sufficient state in a system with high-dimensional complex nonlinear dynamics. In large samples, the proposed method controls the false discovery rate, and selects all sufficient variables with probability approaching one. As the method is agnostic to the reinforcement learning algorithm being applied, it benefits downstream tasks such as policy optimization. Empirical experiments verify theoretical results and show the proposed approach outperforms several competing methods in terms of variable selection accuracy and regret.

Viaarxiv icon

Personalized Pricing with Invalid Instrumental Variables: Identification, Estimation, and Policy Learning

Feb 24, 2023
Rui Miao, Zhengling Qi, Cong Shi, Lin Lin

Figure 1 for Personalized Pricing with Invalid Instrumental Variables: Identification, Estimation, and Policy Learning
Figure 2 for Personalized Pricing with Invalid Instrumental Variables: Identification, Estimation, and Policy Learning
Figure 3 for Personalized Pricing with Invalid Instrumental Variables: Identification, Estimation, and Policy Learning
Figure 4 for Personalized Pricing with Invalid Instrumental Variables: Identification, Estimation, and Policy Learning

Pricing based on individual customer characteristics is widely used to maximize sellers' revenues. This work studies offline personalized pricing under endogeneity using an instrumental variable approach. Standard instrumental variable methods in causal inference/econometrics either focus on a discrete treatment space or require the exclusion restriction of instruments from having a direct effect on the outcome, which limits their applicability in personalized pricing. In this paper, we propose a new policy learning method for Personalized pRicing using Invalid iNsTrumental variables (PRINT) for continuous treatment that allow direct effects on the outcome. Specifically, relying on the structural models of revenue and price, we establish the identifiability condition of an optimal pricing strategy under endogeneity with the help of invalid instrumental variables. Based on this new identification, which leads to solving conditional moment restrictions with generalized residual functions, we construct an adversarial min-max estimator and learn an optimal pricing strategy. Furthermore, we establish an asymptotic regret bound to find an optimal pricing strategy. Finally, we demonstrate the effectiveness of the proposed method via extensive simulation studies as well as a real data application from an US online auto loan company.

Viaarxiv icon

PASTA: Pessimistic Assortment Optimization

Feb 08, 2023
Juncheng Dong, Weibin Mo, Zhengling Qi, Cong Shi, Ethan X. Fang, Vahid Tarokh

Figure 1 for PASTA: Pessimistic Assortment Optimization
Figure 2 for PASTA: Pessimistic Assortment Optimization
Figure 3 for PASTA: Pessimistic Assortment Optimization

We consider a class of assortment optimization problems in an offline data-driven setting. A firm does not know the underlying customer choice model but has access to an offline dataset consisting of the historically offered assortment set, customer choice, and revenue. The objective is to use the offline dataset to find an optimal assortment. Due to the combinatorial nature of assortment optimization, the problem of insufficient data coverage is likely to occur in the offline dataset. Therefore, designing a provably efficient offline learning algorithm becomes a significant challenge. To this end, we propose an algorithm referred to as Pessimistic ASsortment opTimizAtion (PASTA for short) designed based on the principle of pessimism, that can correctly identify the optimal assortment by only requiring the offline data to cover the optimal assortment under general settings. In particular, we establish a regret bound for the offline assortment optimization problem under the celebrated multinomial logit model. We also propose an efficient computational procedure to solve our pessimistic assortment optimization problem. Numerical studies demonstrate the superiority of the proposed method over the existing baseline method.

Viaarxiv icon

STEEL: Singularity-aware Reinforcement Learning

Jan 31, 2023
Xiaohong Chen, Zhengling Qi, Runzhe Wan

Figure 1 for STEEL: Singularity-aware Reinforcement Learning
Figure 2 for STEEL: Singularity-aware Reinforcement Learning
Figure 3 for STEEL: Singularity-aware Reinforcement Learning
Figure 4 for STEEL: Singularity-aware Reinforcement Learning

Batch reinforcement learning (RL) aims at finding an optimal policy in a dynamic environment in order to maximize the expected total rewards by leveraging pre-collected data. A fundamental challenge behind this task is the distributional mismatch between the batch data generating process and the distribution induced by target policies. Nearly all existing algorithms rely on the absolutely continuous assumption on the distribution induced by target policies with respect to the data distribution so that the batch data can be used to calibrate target policies via the change of measure. However, the absolute continuity assumption could be violated in practice, especially when the state-action space is large or continuous. In this paper, we propose a new batch RL algorithm without requiring absolute continuity in the setting of an infinite-horizon Markov decision process with continuous states and actions. We call our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated by a new error analysis on off-policy evaluation, where we use maximum mean discrepancy, together with distributionally robust optimization, to characterize the error of off-policy evaluation caused by the possible singularity and to enable the power of model extrapolation. By leveraging the idea of pessimism and under some mild conditions, we derive a finite-sample regret guarantee for our proposed algorithm without imposing absolute continuity. Compared with existing algorithms, STEEL only requires some minimal data-coverage assumption and thus greatly enhances the applicability and robustness of batch RL. Extensive simulation studies and one real experiment on personalized pricing demonstrate the superior performance of our method when facing possible singularity in batch RL.

Viaarxiv icon

Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization

Jan 05, 2023
Chengchun Shi, Zhengling Qi, Jianing Wang, Fan Zhou

Figure 1 for Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization
Figure 2 for Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization
Figure 3 for Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization
Figure 4 for Value Enhancement of Reinforcement Learning via Efficient and Robust Trust Region Optimization

Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy that maximizes the cumulative rewards in sequential decision making. Most of methods in the existing literature are developed in \textit{online} settings where the data are easy to collect or simulate. Motivated by high stake domains such as mobile health studies with limited and pre-collected data, in this paper, we study \textit{offline} reinforcement learning methods. To efficiently use these datasets for policy optimization, we propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms. Specifically, when the initial policy is not consistent, our method will output a policy whose value is no worse and often better than that of the initial policy. When the initial policy is consistent, under some mild conditions, our method will yield a policy whose value converges to the optimal one at a faster rate than the initial policy, achieving the desired ``value enhancement" property. The proposed method is generally applicable to any parametrized policy that belongs to certain pre-specified function class (e.g., deep neural networks). Extensive numerical studies are conducted to demonstrate the superior performance of our method.

Viaarxiv icon

Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information

Dec 23, 2022
Zuyue Fu, Zhengling Qi, Zhuoran Yang, Zhaoran Wang, Lan Wang

Figure 1 for Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information
Figure 2 for Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information
Figure 3 for Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information
Figure 4 for Offline Reinforcement Learning for Human-Guided Human-Machine Interaction with Private Information

Motivated by the human-machine interaction such as training chatbots for improving customer satisfaction, we study human-guided human-machine interaction involving private information. We model this interaction as a two-player turn-based game, where one player (Alice, a human) guides the other player (Bob, a machine) towards a common goal. Specifically, we focus on offline reinforcement learning (RL) in this game, where the goal is to find a policy pair for Alice and Bob that maximizes their expected total rewards based on an offline dataset collected a priori. The offline setting presents two challenges: (i) We cannot collect Bob's private information, leading to a confounding bias when using standard RL methods, and (ii) a distributional mismatch between the behavior policy used to collect data and the desired policy we aim to learn. To tackle the confounding bias, we treat Bob's previous action as an instrumental variable for Alice's current decision making so as to adjust for the unmeasured confounding. We develop a novel identification result and use it to propose a new off-policy evaluation (OPE) method for evaluating policy pairs in this two-player turn-based game. To tackle the distributional mismatch, we leverage the idea of pessimism and use our OPE method to develop an off-policy learning algorithm for finding a desirable policy pair for both Alice and Bob. Finally, we prove that under mild assumptions such as partial coverage of the offline data, the policy pair obtained through our method converges to the optimal one at a satisfactory rate.

Viaarxiv icon

RISE: Robust Individualized Decision Learning with Sensitive Variables

Nov 12, 2022
Xiaoqing Tan, Zhengling Qi, Christopher W. Seymour, Lu Tang

Figure 1 for RISE: Robust Individualized Decision Learning with Sensitive Variables
Figure 2 for RISE: Robust Individualized Decision Learning with Sensitive Variables
Figure 3 for RISE: Robust Individualized Decision Learning with Sensitive Variables
Figure 4 for RISE: Robust Individualized Decision Learning with Sensitive Variables

This paper introduces RISE, a robust individualized decision learning framework with sensitive variables, where sensitive variables are collectible data and important to the intervention decision, but their inclusion in decision making is prohibited due to reasons such as delayed availability or fairness concerns. A naive baseline is to ignore these sensitive variables in learning decision rules, leading to significant uncertainty and bias. To address this, we propose a decision learning framework to incorporate sensitive variables during offline training but not include them in the input of the learned decision rule during model deployment. Specifically, from a causal perspective, the proposed framework intends to improve the worst-case outcomes of individuals caused by sensitive variables that are unavailable at the time of decision. Unlike most existing literature that uses mean-optimal objectives, we propose a robust learning framework by finding a newly defined quantile- or infimum-optimal decision rule. The reliable performance of the proposed method is demonstrated through synthetic experiments and three real-world applications.

* Accepted at NeurIPS 2022 
Viaarxiv icon

Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach

Oct 26, 2022
Yunzhe Zhou, Zhengling Qi, Chengchun Shi, Lexin Li

Figure 1 for Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach
Figure 2 for Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach
Figure 3 for Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach
Figure 4 for Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach

In this article, we propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting. When the coverage condition does not hold, which is common for offline data, the existing solutions would produce sub-optimal policies. The pessimism principle addresses this issue by discouraging recommendation of actions that are less explored conditioning on the state. However, nearly all pessimism-based methods rely on a key hyper-parameter that quantifies the degree of pessimism, and the performance of the methods can be highly sensitive to the choice of this parameter. We propose to integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism. We derive a credible set whose boundary uniformly lower bounds the optimal Q-function, and thus does not require additional tuning of the degree of pessimism. We develop a general Bayesian learning method that works with a range of models, from Bayesian linear basis model to Bayesian neural network model. We develop the computational algorithm based on variational inference, which is highly efficient and scalable. We establish the theoretical guarantees of the proposed method, and show empirically that it outperforms the existing state-of-the-art solutions through both simulations and a real data example.

* 42 pages, 6 figures 
Viaarxiv icon