Abstract:We study reinforcement learning for revenue management with delayed feedback, where a substantial fraction of value is determined by customer cancellations and modifications observed days after booking. We propose \emph{choice-model-assisted RL}: a calibrated discrete choice model is used as a fixed partial world model to impute the delayed component of the learning target at decision time. In the fixed-model deployment regime, we prove that tabular Q-learning with model-imputed targets converges to an $O(\varepsilon/(1-γ))$ neighborhood of the optimal Q-function, where $\varepsilon$ summarizes partial-model error, with an additional $O(t^{-1/2})$ sampling term. Experiments in a simulator calibrated from 61{,}619 hotel bookings (1{,}088 independent runs) show: (i) no statistically detectable difference from a maturity-buffer DQN baseline in stationary settings; (ii) positive effects under in-family parameter shifts, with significant gains in 5 of 10 shift scenarios after Holm--Bonferroni correction (up to 12.4\%); and (iii) consistent degradation under structural misspecification, where the choice model assumptions are violated (1.4--2.6\% lower revenue). These results characterize when partial behavioral models improve robustness under shift and when they introduce harmful bias.




Abstract:We study a type of Online Linear Programming (OLP) problem that maximizes the objective function with stochastic inputs. The performance of various algorithms that analyze this type of OLP is well studied when the stochastic inputs follow some i.i.d distribution. The two central questions to ask are: (i) can the algorithms achieve the same efficiency if the stochastic inputs are not i.i.d but still stationary, and (ii) how can we modify our algorithms if we know the stochastic inputs are trendy, hence not stationary. We answer the first question by analyzing a regenerative type of input and show the regrets of two popular algorithms are bounded by the same orders as their i.i.d counterparts. We discuss the second question in the context of linearly growing inputs and propose a trend-adaptive algorithm. We provide numerical simulations to illustrate the performance of our algorithms under both regenerative and trendy inputs.