Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vivek S. Borkar

Adynamical systems view of training generativemodels and the memorization phenomenon

May 19, 2026

Siva Athreya, Chiranjib Bhattacharya, Vivek S. Borkar

Abstract:Using recent works of one of the authors (VSB) on collapse in generative models and two time scale dynamics in stochastic gradient descent in high dimensions, we give a system theoretic explanation of the memorization phenomenon in generative models. This relies purely on the dynamic aspects of the training phase. Specifically, we use a result of Austin [2016] to motivate a stylized model for the loss function for stochastic gradient descent (SGD) wherein the loss function has a strong dependence on some variables and weak dependence on the rest in a precise sense. This naturally leads to two distinct time scales in the constant step size SGD that is commonly used in machine learning. This fact has been used to explain the double descent phenomenon in SGD in Borkar [2026]. In conjunction with a mathematical model for collapse phenomenon in SGD developed in Borkar [2025a], we analyze the constant step size SGD using the recent results of Azizian et al. [2024] in order to explain the phenomenon of memorization wherein a generative model that is concurrently being tuned yields the same or similar outputs for significant stretches of time. This gives a novel perspective on the aforementioned phenomena reported in machine learning literature and their interrelationships, using a dynamical systems viewpoint.

* 12 pages

Via

Access Paper or Ask Questions

Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Feb 18, 2026

Rahul Singh, Siddharth Chandak, Eric Moulines, Vivek S. Borkar, Nicholas Bambos

Abstract:We present the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes, without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $ε_n$-Greedy exploration scheme that combines $ε_n$-greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-$\tilde{O}(N^{9/10})$. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our bound is governed by the mixing time and is allowed to converge to one asymptotically.

Via

Access Paper or Ask Questions

An Actor-Critic Algorithm with Function Approximation for Risk Sensitive Cost Markov Decision Processes

Feb 17, 2025

Soumyajit Guin, Vivek S. Borkar, Shalabh Bhatnagar

Figure 1 for An Actor-Critic Algorithm with Function Approximation for Risk Sensitive Cost Markov Decision Processes

Figure 2 for An Actor-Critic Algorithm with Function Approximation for Risk Sensitive Cost Markov Decision Processes

Figure 3 for An Actor-Critic Algorithm with Function Approximation for Risk Sensitive Cost Markov Decision Processes

Figure 4 for An Actor-Critic Algorithm with Function Approximation for Risk Sensitive Cost Markov Decision Processes

Abstract:In this paper, we consider the risk-sensitive cost criterion with exponentiated costs for Markov decision processes and develop a model-free policy gradient algorithm in this setting. Unlike additive cost criteria such as average or discounted cost, the risk-sensitive cost criterion is less studied due to the complexity resulting from the multiplicative structure of the resulting Bellman equation. We develop an actor-critic algorithm with function approximation in this setting and provide its asymptotic convergence analysis. We also show the results of numerical experiments that demonstrate the superiority in performance of our algorithm over other recent algorithms in the literature.

Via

Access Paper or Ask Questions

Lagrangian Index Policy for Restless Bandits with Average Reward

Dec 17, 2024

Konstantin Avrachenkov, Vivek S. Borkar, Pratik Shah

Figure 1 for Lagrangian Index Policy for Restless Bandits with Average Reward

Figure 2 for Lagrangian Index Policy for Restless Bandits with Average Reward

Figure 3 for Lagrangian Index Policy for Restless Bandits with Average Reward

Figure 4 for Lagrangian Index Policy for Restless Bandits with Average Reward

Abstract:We study the Lagrangian Index Policy (LIP) for restless multi-armed bandits with long-run average reward. In particular, we compare the performance of LIP with the performance of the Whittle Index Policy (WIP), both heuristic policies known to be asymptotically optimal under certain natural conditions. Even though in most cases their performances are very similar, in the cases when WIP shows bad performance, LIP continues to perform very well. We then propose reinforcement learning algorithms, both tabular and NN-based, to obtain online learning schemes for LIP in the model-free setting. The proposed reinforcement learning schemes for LIP requires significantly less memory than the analogous scheme for WIP. We calculate analytically the Lagrangian index for the restart model, which describes the optimal web crawling and the minimization of the weighted age of information. We also give a new proof of asymptotic optimality in case of homogeneous bandits as the number of arms goes to infinity, based on exchangeability and de Finetti's theorem.

Via

Access Paper or Ask Questions

A Concentration Bound for TD with Function Approximation

Dec 16, 2023

Siddharth Chandak, Vivek S. Borkar

Abstract:We derive a concentration bound of the type `for all $n \geq n_0$ for some $n_0$' for TD(0) with linear function approximation. We work with online TD learning with samples from a single sample path of the underlying Markov chain. This makes our analysis significantly different from offline TD learning or TD learning with access to independent samples from the stationary distribution of the Markov chain. We treat TD(0) as a contractive stochastic approximation algorithm, with both martingale and Markov noises. Markov noise is handled using the Poisson equation and the lack of almost sure guarantees on boundedness of iterates is handled using the concept of relaxed concentration inequalities.

* Submitted to Stochastic Systems

Via

Access Paper or Ask Questions

Approximation of Convex Envelope Using Reinforcement Learning

Nov 24, 2023

Vivek S. Borkar, Adit Akarsh

Figure 1 for Approximation of Convex Envelope Using Reinforcement Learning

Figure 2 for Approximation of Convex Envelope Using Reinforcement Learning

Figure 3 for Approximation of Convex Envelope Using Reinforcement Learning

Figure 4 for Approximation of Convex Envelope Using Reinforcement Learning

Abstract:Oberman gave a stochastic control formulation of the problem of estimating the convex envelope of a non-convex function. Based on this, we develop a reinforcement learning scheme to approximate the convex envelope, using a variant of Q-learning for controlled optimal stopping. It shows very promising results on a standard library of test problems.

Via

Access Paper or Ask Questions

Decentralised Q-Learning for Multi-Agent Markov Decision Processes with a Satisfiability Criterion

Nov 21, 2023

Keshav P. Keval, Vivek S. Borkar

Abstract:In this paper, we propose a reinforcement learning algorithm to solve a multi-agent Markov decision process (MMDP). The goal, inspired by Blackwell's Approachability Theorem, is to lower the time average cost of each agent to below a pre-specified agent-specific bound. For the MMDP, we assume the state dynamics to be controlled by the joint actions of agents, but the per-stage costs to only depend on the individual agent's actions. We combine the Q-learning algorithm for a weighted combination of the costs of each agent, obtained by a gossip algorithm with the Metropolis-Hastings or Multiplicative Weights formalisms to modulate the averaging matrix of the gossip. We use multiple timescales in our algorithm and prove that under mild conditions, it approximately achieves the desired bounds for each of the agents. We also demonstrate the empirical performance of this algorithm in the more general setting of MMDPs having jointly controlled per-stage costs.

Via

Access Paper or Ask Questions

Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Oct 10, 2022

Shalabh Bhatnagar, Vivek S. Borkar, Soumyajit Guin

Figure 1 for Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Figure 2 for Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Figure 3 for Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Figure 4 for Actor-Critic or Critic-Actor? A Tale of Two Time Scales

Abstract:We revisit the standard formulation of tabular actor-critic algorithm as a two time-scale stochastic approximation with value function computed on a faster time-scale and policy computed on a slower time-scale. This emulates policy iteration. We begin by observing that reversal of the time scales will in fact emulate value iteration and is a legitimate algorithm. We compare the two empirically with and without function approximation (with both linear and nonlinear function approximators) and observe that our proposed critic-actor algorithm performs better empirically though with a marginal increase in the computational cost.

Via

Access Paper or Ask Questions

A Concentration Bound for LSPE($λ$)

Nov 04, 2021

Vivek S. Borkar, Siddharth Chandak, Harsh Dolhare

Abstract:The popular LSPE($\lambda$) algorithm for policy evaluation is revisited to derive a concentration bound that gives high probability performance guarantees from some time on.

* 12 pages, submitted to JMLR

Via

Access Paper or Ask Questions

Concentration of Contractive Stochastic Approximation and Reinforcement Learning

Jun 27, 2021

Siddharth Chandak, Vivek S. Borkar

Abstract:Using a martingale concentration inequality, concentration bounds `from time $n_0$ on' are derived for stochastic approximation algorithms with contractive maps and both martingale difference and Markov noises. These are applied to reinforcement learning algorithms, in particular to asynchronous Q-learning and TD(0).

* 15 pages, Submitted to Stochastic Systems

Via

Access Paper or Ask Questions