Alert button
Picture for Parameswaran Kamalaruban

Parameswaran Kamalaruban

Alert button

Proximal Curriculum for Reinforcement Learning Agents

Apr 25, 2023
Georgios Tzannetos, Bárbara Gomes Ribeiro, Parameswaran Kamalaruban, Adish Singla

Figure 1 for Proximal Curriculum for Reinforcement Learning Agents
Figure 2 for Proximal Curriculum for Reinforcement Learning Agents
Figure 3 for Proximal Curriculum for Reinforcement Learning Agents
Figure 4 for Proximal Curriculum for Reinforcement Learning Agents

We consider the problem of curriculum design for reinforcement learning (RL) agents in contextual multi-task settings. Existing techniques on automatic curriculum design typically require domain-specific hyperparameter tuning or have limited theoretical underpinnings. To tackle these limitations, we design our curriculum strategy, ProCuRL, inspired by the pedagogical concept of Zone of Proximal Development (ZPD). ProCuRL captures the intuition that learning progress is maximized when picking tasks that are neither too hard nor too easy for the learner. We mathematically derive ProCuRL by analyzing two simple learning settings. We also present a practical variant of ProCuRL that can be directly integrated with deep RL frameworks with minimal hyperparameter tuning. Experimental results on a variety of domains demonstrate the effectiveness of our curriculum strategy over state-of-the-art baselines in accelerating the training process of deep RL agents.

* Published in Transactions on Machine Learning Research (TMLR) 2023 
Viaarxiv icon

Learning Personalized Decision Support Policies

Apr 13, 2023
Umang Bhatt, Valerie Chen, Katherine M. Collins, Parameswaran Kamalaruban, Emma Kallina, Adrian Weller, Ameet Talwalkar

Figure 1 for Learning Personalized Decision Support Policies
Figure 2 for Learning Personalized Decision Support Policies
Figure 3 for Learning Personalized Decision Support Policies
Figure 4 for Learning Personalized Decision Support Policies

Individual human decision-makers may benefit from different forms of support to improve decision outcomes. However, a key question is which form of support will lead to accurate decisions at a low cost. In this work, we propose learning a decision support policy that, for a given input, chooses which form of support, if any, to provide. We consider decision-makers for whom we have no prior information and formalize learning their respective policies as a multi-objective optimization problem that trades off accuracy and cost. Using techniques from stochastic contextual bandits, we propose $\texttt{THREAD}$, an online algorithm to personalize a decision support policy for each decision-maker, and devise a hyper-parameter tuning strategy to identify a cost-performance trade-off using simulated human behavior. We provide computational experiments to demonstrate the benefits of $\texttt{THREAD}$ compared to offline baselines. We then introduce $\texttt{Modiste}$, an interactive tool that provides $\texttt{THREAD}$ with an interface. We conduct human subject experiments to show how $\texttt{Modiste}$ learns policies personalized to each decision-maker and discuss the nuances of learning decision support policies online for real users.

* Working paper 
Viaarxiv icon

Robust Learning from Observation with Model Misspecification

Feb 15, 2022
Luca Viano, Yu-Ting Huang, Parameswaran Kamalaruban, Craig Innes, Subramanian Ramamoorthy, Adrian Weller

Figure 1 for Robust Learning from Observation with Model Misspecification
Figure 2 for Robust Learning from Observation with Model Misspecification
Figure 3 for Robust Learning from Observation with Model Misspecification
Figure 4 for Robust Learning from Observation with Model Misspecification

Imitation learning (IL) is a popular paradigm for training policies in robotic systems when specifying the reward function is difficult. However, despite the success of IL algorithms, they impose the somewhat unrealistic requirement that the expert demonstrations must come from the same domain in which a new imitator policy is to be learned. We consider a practical setting, where (i) state-only expert demonstrations from the real (deployment) environment are given to the learner, (ii) the imitation learner policy is trained in a simulation (training) environment whose transition dynamics is slightly different from the real environment, and (iii) the learner does not have any access to the real environment during the training phase beyond the batch of demonstrations given. Most of the current IL methods, such as generative adversarial imitation learning and its state-only variants, fail to imitate the optimal expert behavior under the above setting. By leveraging insights from the Robust reinforcement learning (RL) literature and building on recent adversarial imitation approaches, we propose a robust IL algorithm to learn policies that can effectively transfer to the real environment without fine-tuning. Furthermore, we empirically demonstrate on continuous-control benchmarks that our method outperforms the state-of-the-art state-only IL method in terms of the zero-shot transfer performance in the real environment and robust performance under different testing conditions.

* accepted to AAMAS 2022 (camera-ready version) 
Viaarxiv icon

Curriculum Design for Teaching via Demonstrations: Theory and Applications

Jun 08, 2021
Gaurav Yengera, Rati Devidze, Parameswaran Kamalaruban, Adish Singla

Figure 1 for Curriculum Design for Teaching via Demonstrations: Theory and Applications
Figure 2 for Curriculum Design for Teaching via Demonstrations: Theory and Applications
Figure 3 for Curriculum Design for Teaching via Demonstrations: Theory and Applications
Figure 4 for Curriculum Design for Teaching via Demonstrations: Theory and Applications

We consider the problem of teaching via demonstrations in sequential decision-making settings. In particular, we study how to design a personalized curriculum over demonstrations to speed up the learner's convergence. We provide a unified curriculum strategy for two popular learner models: Maximum Causal Entropy Inverse Reinforcement Learning (MaxEnt-IRL) and Cross-Entropy Behavioral Cloning (CrossEnt-BC). Our unified strategy induces a ranking over demonstrations based on a notion of difficulty scores computed w.r.t. the teacher's optimal policy and the learner's current policy. Compared to the state of the art, our strategy doesn't require access to the learner's internal dynamics and still enjoys similar convergence guarantees under mild technical conditions. Furthermore, we adapt our curriculum strategy to teach a learner using domain knowledge in the form of task-specific difficulty scores when the teacher's optimal policy is unknown. Experiments on a car driving simulator environment and shortest path problems in a grid-world environment demonstrate the effectiveness of our proposed curriculum strategy.

Viaarxiv icon

Robust Inverse Reinforcement Learning under Transition Dynamics Mismatch

Jul 02, 2020
Luca Viano, Yu-Ting Huang, Parameswaran Kamalaruban, Volkan Cevher

Figure 1 for Robust Inverse Reinforcement Learning under Transition Dynamics Mismatch
Figure 2 for Robust Inverse Reinforcement Learning under Transition Dynamics Mismatch
Figure 3 for Robust Inverse Reinforcement Learning under Transition Dynamics Mismatch
Figure 4 for Robust Inverse Reinforcement Learning under Transition Dynamics Mismatch

We study the inverse reinforcement learning (IRL) problem under the \emph{transition dynamics mismatch} between the expert and the learner. In particular, we consider the Maximum Causal Entropy (MCE) IRL learner model and provide an upper bound on the learner's performance degradation based on the $\ell_1$-distance between the two transition dynamics of the expert and the learner. Then, by leveraging insights from the Robust RL literature, we propose a robust MCE IRL algorithm, which is a principled approach to help with this mismatch issue. Finally, we empirically demonstrate the stable performance of our algorithm compared to the standard MCE IRL algorithm under transition mismatches in finite MDP problems.

Viaarxiv icon

Interaction-limited Inverse Reinforcement Learning

Jul 01, 2020
Martin Troussard, Emmanuel Pignat, Parameswaran Kamalaruban, Sylvain Calinon, Volkan Cevher

Figure 1 for Interaction-limited Inverse Reinforcement Learning
Figure 2 for Interaction-limited Inverse Reinforcement Learning
Figure 3 for Interaction-limited Inverse Reinforcement Learning
Figure 4 for Interaction-limited Inverse Reinforcement Learning

This paper proposes an inverse reinforcement learning (IRL) framework to accelerate learning when the learner-teacher \textit{interaction} is \textit{limited} during training. Our setting is motivated by the realistic scenarios where a helpful teacher is not available or when the teacher cannot access the learning dynamics of the student. We present two different training strategies: Curriculum Inverse Reinforcement Learning (CIRL) covering the teacher's perspective, and Self-Paced Inverse Reinforcement Learning (SPIRL) focusing on the learner's perspective. Using experiments in simulations and experiments with a real robot learning a task from a human demonstrator, we show that our training strategies can allow a faster training than a random teacher for CIRL and than a batch learner for SPIRL.

Viaarxiv icon

Environment Shaping in Reinforcement Learning using State Abstraction

Jun 23, 2020
Parameswaran Kamalaruban, Rati Devidze, Volkan Cevher, Adish Singla

Figure 1 for Environment Shaping in Reinforcement Learning using State Abstraction
Figure 2 for Environment Shaping in Reinforcement Learning using State Abstraction
Figure 3 for Environment Shaping in Reinforcement Learning using State Abstraction

One of the central challenges faced by a reinforcement learning (RL) agent is to effectively learn a (near-)optimal policy in environments with large state spaces having sparse and noisy feedback signals. In real-world applications, an expert with additional domain knowledge can help in speeding up the learning process via \emph{shaping the environment}, i.e., making the environment more learner-friendly. A popular paradigm in literature is \emph{potential-based reward shaping}, where the environment's reward function is augmented with additional local rewards using a potential function. However, the applicability of potential-based reward shaping is limited in settings where (i) the state space is very large, and it is challenging to compute an appropriate potential function, (ii) the feedback signals are noisy, and even with shaped rewards the agent could be trapped in local optima, and (iii) changing the rewards alone is not sufficient, and effective shaping requires changing the dynamics. We address these limitations of potential-based shaping methods and propose a novel framework of \emph{environment shaping using state abstraction}. Our key idea is to compress the environment's large state space with noisy signals to an abstracted space, and to use this abstraction in creating smoother and more effective feedback signals for the agent. We study the theoretical underpinnings of our abstraction-based environment shaping, and show that the agent's policy learnt in the shaped environment preserves near-optimal behavior in the original environment.

Viaarxiv icon

Robust Reinforcement Learning via Adversarial training with Langevin Dynamics

Feb 14, 2020
Parameswaran Kamalaruban, Yu-Ting Huang, Ya-Ping Hsieh, Paul Rolland, Cheng Shi, Volkan Cevher

Figure 1 for Robust Reinforcement Learning via Adversarial training with Langevin Dynamics
Figure 2 for Robust Reinforcement Learning via Adversarial training with Langevin Dynamics
Figure 3 for Robust Reinforcement Learning via Adversarial training with Langevin Dynamics
Figure 4 for Robust Reinforcement Learning via Adversarial training with Langevin Dynamics

We introduce a sampling perspective to tackle the challenging task of training robust Reinforcement Learning (RL) agents. Leveraging the powerful Stochastic Gradient Langevin Dynamics, we present a novel, scalable two-player RL algorithm, which is a sampling variant of the two-player policy gradient method. Our algorithm consistently outperforms existing baselines, in terms of generalization across different training and testing conditions, on several MuJoCo environments. Our experiments also show that, even for objective functions that entirely ignore potential environmental shifts, our sampling approach remains highly robust in comparison to standard RL algorithms.

Viaarxiv icon

Optimization for Reinforcement Learning: From Single Agent to Cooperative Agents

Dec 01, 2019
Donghwan Lee, Niao He, Parameswaran Kamalaruban, Volkan Cevher

Figure 1 for Optimization for Reinforcement Learning: From Single Agent to Cooperative Agents
Figure 2 for Optimization for Reinforcement Learning: From Single Agent to Cooperative Agents
Figure 3 for Optimization for Reinforcement Learning: From Single Agent to Cooperative Agents
Figure 4 for Optimization for Reinforcement Learning: From Single Agent to Cooperative Agents

This article reviews recent advances in multi-agent reinforcement learning algorithms for large-scale control systems and communication networks, which learn to communicate and cooperate. We provide an overview of this emerging field, with an emphasis on the decentralized setting under different coordination protocols. We highlight the evolution of reinforcement learning algorithms from single-agent to multi-agent systems, from a distributed optimization perspective, and conclude with future directions and challenges, in the hope to catalyze the growing synergy among distributed optimization, signal processing, and reinforcement learning communities.

Viaarxiv icon

Interactive Teaching Algorithms for Inverse Reinforcement Learning

Jun 05, 2019
Parameswaran Kamalaruban, Rati Devidze, Volkan Cevher, Adish Singla

Figure 1 for Interactive Teaching Algorithms for Inverse Reinforcement Learning
Figure 2 for Interactive Teaching Algorithms for Inverse Reinforcement Learning
Figure 3 for Interactive Teaching Algorithms for Inverse Reinforcement Learning
Figure 4 for Interactive Teaching Algorithms for Inverse Reinforcement Learning

We study the problem of inverse reinforcement learning (IRL) with the added twist that the learner is assisted by a helpful teacher. More formally, we tackle the following algorithmic question: How could a teacher provide an informative sequence of demonstrations to an IRL learner to speed up the learning process? We present an interactive teaching framework where a teacher adaptively chooses the next demonstration based on learner's current policy. In particular, we design teaching algorithms for two concrete settings: an omniscient setting where a teacher has full knowledge about the learner's dynamics and a blackbox setting where the teacher has minimal knowledge. Then, we study a sequential variant of the popular MCE-IRL learner and prove convergence guarantees of our teaching algorithm in the omniscient setting. Extensive experiments with a car driving simulator environment show that the learning progress can be speeded up drastically as compared to an uninformative teacher.

* IJCAI'19 paper (extended version) 
Viaarxiv icon