Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

We consider the problems of estimation and optimization of utility-based shortfall risk (UBSR), which is a popular risk measure in finance. In the context of UBSR estimation, we derive a non-asymptotic bound on the mean-squared error of the classical sample average approximation (SAA) of UBSR. Next, in the context of UBSR optimization, we derive an expression for the UBSR gradient under a smooth parameterization. This expression is a ratio of expectations, both of which involve the UBSR. We use SAA for the numerator as well as denominator in the UBSR gradient expression to arrive at a biased gradient estimator. We derive non-asymptotic bounds on the estimation error, which show that our gradient estimator is asymptotically unbiased. We incorporate the aforementioned gradient estimator into a stochastic gradient (SG) algorithm for UBSR optimization. Finally, we derive non-asymptotic bounds that quantify the rate of convergence of our SG algorithm for UBSR optimization.

Via

We tackle the problem of estimating the Value-at-Risk (VaR) and the Conditional Value-at-Risk (CVaR) of the infinite-horizon discounted cost within a Markov cost process. First, we derive a minimax lower bound of $\Omega(1/\sqrt{n})$ that holds both in an expected and in a probabilistic sense. Then, using a finite-horizon truncation scheme, we derive an upper bound for the error in CVaR estimation, which matches our lower bound up to constant factors. Finally, we discuss an extension of our estimation scheme that covers more general risk measures satisfying a certain continuity criterion, e.g., spectral risk measures, utility-based shortfall risk. To the best of our knowledge, our work is the first to provide lower and upper bounds on the estimation error for any risk measure within Markovian settings. We remark that our lower bounds also extend to the infinite-horizon discounted costs' mean. Even in that case, our result $\Omega(1/\sqrt{n}) $ improves upon the existing result $\Omega(1/n)$[13].

Via

We consider the problem of control in the setting of reinforcement learning (RL), where model information is not available. Policy gradient algorithms are a popular solution approach for this problem and are usually shown to converge to a stationary point of the value function. In this paper, we propose two policy Newton algorithms that incorporate cubic regularization. Both algorithms employ the likelihood ratio method to form estimates of the gradient and Hessian of the value function using sample trajectories. The first algorithm requires an exact solution of the cubic regularized problem in each iteration, while the second algorithm employs an efficient gradient descent-based approximation to the cubic regularized problem. We establish convergence of our proposed algorithms to a second-order stationary point (SOSP) of the value function, which results in the avoidance of traps in the form of saddle points. In particular, the sample complexity of our algorithms to find an $\epsilon$-SOSP is $O(\epsilon^{-3.5})$, which is an improvement over the state-of-the-art sample complexity of $O(\epsilon^{-4.5})$.

Via

We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.

Via

In this paper, we present a stochastic gradient algorithm for minimizing a smooth objective function that is an expectation over noisy cost samples and only the latter are observed for any given parameter. Our algorithm employs a gradient estimation scheme with random perturbations, which are formed using the truncated Cauchy distribution from the unit sphere. We analyze the bias and variance of the proposed gradient estimator. Our algorithm is found to be particularly useful in the case when the objective function is non-convex, and the parameter dimension is high. From an asymptotic convergence analysis, we establish that our algorithm converges almost surely to the set of stationary points of the objective function and obtain the asymptotic convergence rate. We also show that our algorithm avoids unstable equilibria, implying convergence to local minima. Further, we perform a non-asymptotic convergence analysis of our algorithm. In particular, we establish here a non-asymptotic bound for finding an $\epsilon$-stationary point of the non-convex objective function. Finally, we demonstrate numerically through simulations that the performance of our algorithm outperforms GSF, SPSA and RDSA by a significant margin over a few non-convex settings and further validate its performance over convex (noisy) objectives.

Via

In several applications such as clinical trials and financial portfolio optimization, the expected value (or the average reward) does not satisfactorily capture the merits of a drug or a portfolio. In such applications, risk plays a crucial role, and a risk-aware performance measure is preferable, so as to capture losses in the case of adverse events. This survey aims to consolidate and summarise the existing research on risk measures, specifically in the context of multi-armed bandits. We review various risk measures of interest, and comment on their properties. Next, we review existing concentration inequalities for various risk measures. Then, we proceed to defining risk-aware bandit problems, We consider algorithms for the regret minimization setting, where the exploration-exploitation trade-off manifests, as well as the best-arm identification setting, which is a pure exploration problem -- both in the context of risk-sensitive measures. We conclude by commenting on persisting challenges and fertile areas for future research.

Via

We consider the problem of sequentially learning to estimate, in the mean squared error (MSE) sense, a Gaussian $K$-vector of unknown covariance by observing only $m < K$ of its entries in each round. This reduces to learning an optimal subset for estimating the entire vector. Towards this, we first establish an exponential concentration bound for an estimate of the MSE for each observable subset. We then frame the estimation problem with bandit feedback in the best-subset identification setting. We propose a variant of the successive elimination algorithm to cater to the adaptive estimation problem, and we derive an upper bound on the sample complexity of this algorithm. In addition, to understand the fundamental limit on the sample complexity of this adaptive estimation bandit problem, we derive a minimax lower bound.

Via

Utility-Based Shortfall Risk (UBSR) is a risk metric that is increasingly popular in financial applications, owing to certain desirable properties that it enjoys. We consider the problem of estimating UBSR in a recursive setting, where samples from the underlying loss distribution are available one-at-a-time. We cast the UBSR estimation problem as a root finding problem, and propose stochastic approximation-based estimations schemes. We derive non-asymptotic bounds on the estimation error in the number of samples. We also consider the problem of UBSR optimization within a parameterized class of random variables. We propose a stochastic gradient descent based algorithm for UBSR optimization, and derive non-asymptotic bounds on its convergence.

Via

We consider the problem of estimating a spectral risk measure (SRM) from i.i.d. samples, and propose a novel method that is based on numerical integration. We show that our SRM estimate concentrates exponentially, when the underlying distribution has bounded support. Further, we also consider the case when the underlying distribution is either Gaussian or exponential, and derive a concentration bound for our estimation scheme. We validate the theoretical findings on a synthetic setup, and in a vehicular traffic routing application.

Via

The classic objective in a reinforcement learning (RL) problem is to find a policy that minimizes, in expectation, a long-run objective such as the infinite-horizon discounted or long-run average cost. In many practical applications, optimizing the expected value alone is not sufficient, and it may be necessary to include a risk measure in the optimization process, either as the objective or as a constraint. Various risk measures have been proposed in the literature, e.g., mean-variance tradeoff, exponential utility, the percentile performance, value at risk, conditional value at risk, prospect theory and its later enhancement, cumulative prospect theory. In this article, we focus on the combination of risk criteria and reinforcement learning in a constrained optimization framework, i.e., a setting where the goal to find a policy that optimizes the usual objective of infinite-horizon discounted/average cost, while ensuring that an explicit risk constraint is satisfied. We introduce the risk-constrained RL framework, cover popular risk measures based on variance, conditional value-at-risk and cumulative prospect theory, and present a template for a risk-sensitive RL algorithm. We survey some of our recent work on this topic, covering problems encompassing discounted cost, average cost, and stochastic shortest path settings, together with the aforementioned risk measures in a constrained framework. This non-exhaustive survey is aimed at giving a flavor of the challenges involved in solving a risk-sensitive RL problem, and outlining some potential future research directions.

Via