Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jayakumar Subramanian

Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning

Jun 08, 2025

Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup Rao, Jayakumar Subramanian, Branislav Kveton

Abstract:Question answering (QA) agents automatically answer questions posed in natural language. In this work, we learn to ask clarifying questions in QA agents. The key idea in our method is to simulate conversations that contain clarifying questions and learn from them using reinforcement learning (RL). To make RL practical, we propose and analyze offline RL objectives that can be viewed as reward-weighted supervised fine-tuning (SFT) and easily optimized in large language models. Our work stands in a stark contrast to recently proposed methods, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize rewards. We compare to these methods empirically and report gains in both optimized rewards and language quality.

* 39 pages

Via

Access Paper or Ask Questions

Behavior Optimized Image Generation

Nov 18, 2023

Varun Khurana, Yaman K Singla, Jayakumar Subramanian, Rajiv Ratn Shah, Changyou Chen, Zhiqiang Xu, Balaji Krishnamurthy

Figure 1 for Behavior Optimized Image Generation

Figure 2 for Behavior Optimized Image Generation

Figure 3 for Behavior Optimized Image Generation

Figure 4 for Behavior Optimized Image Generation

Abstract:The last few years have witnessed great success on image generation, which has crossed the acceptance thresholds of aesthetics, making it directly applicable to personal and commercial applications. However, images, especially in marketing and advertising applications, are often created as a means to an end as opposed to just aesthetic concerns. The goal can be increasing sales, getting more clicks, likes, or image sales (in the case of stock businesses). Therefore, the generated images need to perform well on these key performance indicators (KPIs), in addition to being aesthetically good. In this paper, we make the first endeavor to answer the question of "How can one infuse the knowledge of the end-goal within the image generation process itself to create not just better-looking images but also "better-performing'' images?''. We propose BoigLLM, an LLM that understands both image content and user behavior. BoigLLM knows how an image should look to get a certain required KPI. We show that BoigLLM outperforms 13x larger models such as GPT-3.5 and GPT-4 in this task, demonstrating that while these state-of-the-art models can understand images, they lack information on how these images perform in the real world. To generate actual pixels of behavior-conditioned images, we train a diffusion-based model (BoigSD) to align with a proposed BoigLLM-defined reward. We show the performance of the overall pipeline on two datasets covering two different behaviors: a stock dataset with the number of forward actions as the KPI and a dataset containing tweets with the total likes as the KPI, denoted as BoigBench. To advance research in the direction of utility-driven image generation and understanding, we release BoigBench, a benchmark dataset containing 168 million enterprise tweets with their media, brand account names, time of post, and total likes.

Via

Access Paper or Ask Questions

Counterfactual Explanation Policies in RL

Jul 25, 2023

Shripad V. Deshmukh, Srivatsan R, Supriti Vijay, Jayakumar Subramanian, Chirag Agarwal

Abstract:As Reinforcement Learning (RL) agents are increasingly employed in diverse decision-making problems using reward preferences, it becomes important to ensure that policies learned by these frameworks in mapping observations to a probability distribution of the possible actions are explainable. However, there is little to no work in the systematic understanding of these complex policies in a contrastive manner, i.e., what minimal changes to the policy would improve/worsen its performance to a desired level. In this work, we present COUNTERPOL, the first framework to analyze RL policies using counterfactual explanations in the form of minimal changes to the policy that lead to the desired outcome. We do so by incorporating counterfactuals in supervised learning in RL with the target outcome regulated using desired return. We establish a theoretical connection between Counterpol and widely used trust region-based policy optimization methods in RL. Extensive empirical analysis shows the efficacy of COUNTERPOL in generating explanations for (un)learning skills while keeping close to the original policy. Our results on five different RL environments with diverse state and action spaces demonstrate the utility of counterfactual explanations, paving the way for new frontiers in designing and developing counterfactual policies.

* ICML Workshop on Counterfactuals in Minds and Machines, 2023

Via

Access Paper or Ask Questions

SARC: Soft Actor Retrospective Critic

Jun 28, 2023

Sukriti Verma, Ayush Chopra, Jayakumar Subramanian, Mausoom Sarkar, Nikaash Puri, Piyush Gupta, Balaji Krishnamurthy

Figure 1 for SARC: Soft Actor Retrospective Critic

Figure 2 for SARC: Soft Actor Retrospective Critic

Figure 3 for SARC: Soft Actor Retrospective Critic

Figure 4 for SARC: Soft Actor Retrospective Critic

Abstract:The two-time scale nature of SAC, which is an actor-critic algorithm, is characterised by the fact that the critic estimate has not converged for the actor at any given time, but since the critic learns faster than the actor, it ensures eventual consistency between the two. Various strategies have been introduced in literature to learn better gradient estimates to help achieve better convergence. Since gradient estimates depend upon the critic, we posit that improving the critic can provide a better gradient estimate for the actor at each time. Utilizing this, we propose Soft Actor Retrospective Critic (SARC), where we augment the SAC critic loss with another loss term - retrospective loss - leading to faster critic convergence and consequently, better policy gradient estimates for the actor. An existing implementation of SAC can be easily adapted to SARC with minimal modifications. Through extensive experimentation and analysis, we show that SARC provides consistent improvement over SAC on benchmark environments. We plan to open-source the code and all experiment data at: https://github.com/sukritiverma1996/SARC.

* Accepted at RLDM 2022

Via

Access Paper or Ask Questions

Explaining RL Decisions with Trajectories

May 06, 2023

Shripad Vilasrao Deshmukh, Arpan Dasgupta, Balaji Krishnamurthy, Nan Jiang, Chirag Agarwal, Georgios Theocharous, Jayakumar Subramanian

Figure 1 for Explaining RL Decisions with Trajectories

Figure 2 for Explaining RL Decisions with Trajectories

Figure 3 for Explaining RL Decisions with Trajectories

Figure 4 for Explaining RL Decisions with Trajectories

Abstract:Explanation is a key component for the adoption of reinforcement learning (RL) in many real-world decision-making problems. In the literature, the explanation is often provided by saliency attribution to the features of the RL agent's state. In this work, we propose a complementary approach to these explanations, particularly for offline RL, where we attribute the policy decisions of a trained RL agent to the trajectories encountered by it during training. To do so, we encode trajectories in offline training data individually as well as collectively (encoding a set of trajectories). We then attribute policy decisions to a set of trajectories in this encoded space by estimating the sensitivity of the decision with respect to that set. Further, we demonstrate the effectiveness of the proposed approach in terms of quality of attributions as well as practical scalability in diverse environments that involve both discrete and continuous state and action spaces such as grid-worlds, video games (Atari) and continuous control (MuJoCo). We also conduct a human study on a simple navigation task to observe how their understanding of the task compares with data attributed for a trained RL policy. Keywords -- Explainable AI, Verifiability of AI Decisions, Explainable RL.

* Published at International Conference on Learning Representations (ICLR), 2023

Via

Access Paper or Ask Questions

Differentiable Agent-based Epidemiology

Jul 20, 2022

Ayush Chopra, Alexander Rodríguez, Jayakumar Subramanian, Balaji Krishnamurthy, B. Aditya Prakash, Ramesh Raskar

Figure 1 for Differentiable Agent-based Epidemiology

Figure 2 for Differentiable Agent-based Epidemiology

Figure 3 for Differentiable Agent-based Epidemiology

Figure 4 for Differentiable Agent-based Epidemiology

Abstract:Mechanistic simulators are an indispensable tool for epidemiology to explore the behavior of complex, dynamic infections under varying conditions and navigate uncertain environments. ODE-based models are the dominant paradigm that enable fast simulations and are tractable to gradient-based optimization, but make simplifying assumptions about population homogeneity. Agent-based models (ABMs) are an increasingly popular alternative paradigm that can represent the heterogeneity of contact interactions with granular detail and agency of individual behavior. However, conventional ABM frameworks are not differentiable and present challenges in scalability; due to which it is non-trivial to connect them to auxiliary data sources easily. In this paper we introduce GradABM which is a new scalable, fast and differentiable design for ABMs. GradABM runs simulations in few seconds on commodity hardware and enables fast forward and differentiable inverse simulations. This makes it amenable to be merged with deep neural networks and seamlessly integrate heterogeneous data sources to help with calibration, forecasting and policy evaluation. We demonstrate the efficacy of GradABM via extensive experiments with real COVID-19 and influenza datasets. We are optimistic this work will bring ABM and AI communities closer together.

* 14 pages, 5 figures

Via

Access Paper or Ask Questions

DeepABM: Scalable, efficient and differentiable agent-based simulations via graph neural networks

Oct 09, 2021

Ayush Chopra, Esma Gel, Jayakumar Subramanian, Balaji Krishnamurthy, Santiago Romero-Brufau, Kalyan S. Pasupathy, Thomas C. Kingsley, Ramesh Raskar

Figure 1 for DeepABM: Scalable, efficient and differentiable agent-based simulations via graph neural networks

Figure 2 for DeepABM: Scalable, efficient and differentiable agent-based simulations via graph neural networks

Figure 3 for DeepABM: Scalable, efficient and differentiable agent-based simulations via graph neural networks

Figure 4 for DeepABM: Scalable, efficient and differentiable agent-based simulations via graph neural networks

Abstract:We introduce DeepABM, a framework for agent-based modeling that leverages geometric message passing of graph neural networks for simulating action and interactions over large agent populations. Using DeepABM allows scaling simulations to large agent populations in real-time and running them efficiently on GPU architectures. To demonstrate the effectiveness of DeepABM, we build DeepABM-COVID simulator to provide support for various non-pharmaceutical interventions (quarantine, exposure notification, vaccination, testing) for the COVID-19 pandemic, and can scale to populations of representative size in real-time on a GPU. Specifically, DeepABM-COVID can model 200 million interactions (over 100,000 agents across 180 time-steps) in 90 seconds, and is made available online to help researchers with modeling and analysis of various interventions. We explain various components of the framework and discuss results from one research study to evaluate the impact of delaying the second dose of the COVID-19 vaccine in collaboration with clinical and public health experts. While we simulate COVID-19 spread, the ideas introduced in the paper are generic and can be easily extend to other forms of agent-based simulations. Furthermore, while beyond scope of this document, DeepABM enables inverse agent-based simulations which can be used to learn physical parameters in the (micro) simulations using gradient-based optimization with large-scale real-world (macro) data. We are optimistic that the current work can have interesting implications for bringing ABM and AI communities closer.

* Accepted at Winter Simulation Conference 2021

Via

Access Paper or Ask Questions

Medical Dead-ends and Learning to Identify High-risk States and Treatments

Oct 08, 2021

Mehdi Fatemi, Taylor W. Killian, Jayakumar Subramanian, Marzyeh Ghassemi

Figure 1 for Medical Dead-ends and Learning to Identify High-risk States and Treatments

Figure 2 for Medical Dead-ends and Learning to Identify High-risk States and Treatments

Figure 3 for Medical Dead-ends and Learning to Identify High-risk States and Treatments

Figure 4 for Medical Dead-ends and Learning to Identify High-risk States and Treatments

Abstract:Machine learning has successfully framed many sequential decision making problems as either supervised prediction, or optimal decision-making policy identification via reinforcement learning. In data-constrained offline settings, both approaches may fail as they assume fully optimal behavior or rely on exploring alternatives that may not exist. We introduce an inherently different approach that identifies possible ``dead-ends'' of a state space. We focus on the condition of patients in the intensive care unit, where a ``medical dead-end'' indicates that a patient will expire, regardless of all potential future treatment sequences. We postulate ``treatment security'' as avoiding treatments with probability proportional to their chance of leading to dead-ends, present a formal proof, and frame discovery as an RL problem. We then train three independent deep neural models for automated state construction, dead-end discovery and confirmation. Our empirical results discover that dead-ends exist in real clinical data among septic patients, and further reveal gaps between secure treatments and those that were administered.

Via

Access Paper or Ask Questions

An Empirical Study of Representation Learning for Reinforcement Learning in Healthcare

Nov 23, 2020

Taylor W. Killian, Haoran Zhang, Jayakumar Subramanian, Mehdi Fatemi, Marzyeh Ghassemi

Figure 1 for An Empirical Study of Representation Learning for Reinforcement Learning in Healthcare

Figure 2 for An Empirical Study of Representation Learning for Reinforcement Learning in Healthcare

Figure 3 for An Empirical Study of Representation Learning for Reinforcement Learning in Healthcare

Figure 4 for An Empirical Study of Representation Learning for Reinforcement Learning in Healthcare

Abstract:Reinforcement Learning (RL) has recently been applied to sequential estimation and prediction problems identifying and developing hypothetical treatment strategies for septic patients, with a particular focus on offline learning with observational data. In practice, successful RL relies on informative latent states derived from sequential observations to develop optimal treatment strategies. To date, how best to construct such states in a healthcare setting is an open question. In this paper, we perform an empirical study of several information encoding architectures using data from septic patients in the MIMIC-III dataset to form representations of a patient state. We evaluate the impact of representation dimension, correlations with established acuity scores, and the treatment policies derived from them. We find that sequentially formed state representations facilitate effective policy learning in batch settings, validating a more thoughtful approach to representation learning that remains faithful to the sequential and partial nature of healthcare data.

* To appear in proceedings of the 2020 Machine Learning for Health workshop at NeurIPS

Via

Access Paper or Ask Questions

Approximate information state for approximate planning and reinforcement learning in partially observed systems

Oct 17, 2020

Jayakumar Subramanian, Amit Sinha, Raihan Seraj, Aditya Mahajan

Figure 1 for Approximate information state for approximate planning and reinforcement learning in partially observed systems

Figure 2 for Approximate information state for approximate planning and reinforcement learning in partially observed systems

Figure 3 for Approximate information state for approximate planning and reinforcement learning in partially observed systems

Figure 4 for Approximate information state for approximate planning and reinforcement learning in partially observed systems

Abstract:We propose a theoretical framework for approximate planning and learning in partially observed systems. Our framework is based on the fundamental notion of information state. We provide two equivalent definitions of information state---i) a function of history which is sufficient to compute the expected reward and predict its next value; ii) equivalently, a function of the history which can be recursively updated and is sufficient to compute the expected reward and predict the next observation. An information state always leads to a dynamic programming decomposition. Our key result is to show that if a function of the history (called approximate information state (AIS)) approximately satisfies the properties of the information state, then there is a corresponding approximate dynamic program. We show that the policy computed using this is approximately optimal with bounded loss of optimality. We show that several approximations in state, observation and action spaces in literature can be viewed as instances of AIS. In some of these cases, we obtain tighter bounds. A salient feature of AIS is that it can be learnt from data. We present AIS based multi-time scale policy gradient algorithms. and detailed numerical experiments with low, moderate and high dimensional environments.

* 78 pages

Via

Access Paper or Ask Questions