Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Joelle Pineau

Editors

Novelty Search in Representational Space for Sample Efficient Exploration

Oct 21, 2020

Ruo Yu Tao, Vincent François-Lavet, Joelle Pineau

Figure 1 for Novelty Search in Representational Space for Sample Efficient Exploration

Figure 2 for Novelty Search in Representational Space for Sample Efficient Exploration

Figure 3 for Novelty Search in Representational Space for Sample Efficient Exploration

Figure 4 for Novelty Search in Representational Space for Sample Efficient Exploration

Abstract:We present a new approach for efficient exploration which leverages a low-dimensional encoding of the environment learned with a combination of model-based and model-free objectives. Our approach uses intrinsic rewards that are based on the distance of nearest neighbors in the low dimensional representational space to gauge novelty. We then leverage these intrinsic rewards for sample-efficient exploration with planning routines in representational space for hard exploration tasks with sparse rewards. One key element of our approach is the use of information theoretic principles to shape our representations in a way so that our novelty reward goes beyond pixel similarity. We test our approach on a number of maze tasks, as well as a control problem and show that our exploration approach is more sample-efficient compared to strong baselines.

* 10 pages + references + appendix. Oral presentation at NeurIPS 2020

Via

Access Paper or Ask Questions

Regularized Inverse Reinforcement Learning

Oct 07, 2020

Wonseok Jeon, Chen-Yang Su, Paul Barde, Thang Doan, Derek Nowrouzezahrai, Joelle Pineau

Figure 1 for Regularized Inverse Reinforcement Learning

Figure 2 for Regularized Inverse Reinforcement Learning

Figure 3 for Regularized Inverse Reinforcement Learning

Figure 4 for Regularized Inverse Reinforcement Learning

Abstract:Inverse Reinforcement Learning (IRL) aims to facilitate a learner's ability to imitate expert behavior by acquiring reward functions that explain the expert's decisions. Regularized IRL applies convex regularizers to the learner's policy in order to avoid the expert's behavior being rationalized by arbitrary constant rewards, also known as degenerate solutions. We propose analytical solutions, and practical methods to obtain them, for regularized IRL. Current methods are restricted to the maximum-entropy IRL framework, limiting them to Shannon-entropy regularizers, as well as proposing functional-form solutions that are generally intractable. We present theoretical backing for our proposed IRL method's applicability to both discrete and continuous controls and empirically validate its performance on a variety of tasks.

* 22 pages, 7 figures

Via

Access Paper or Ask Questions

Constrained Markov Decision Processes via Backward Value Functions

Aug 26, 2020

Harsh Satija, Philip Amortila, Joelle Pineau

Figure 1 for Constrained Markov Decision Processes via Backward Value Functions

Figure 2 for Constrained Markov Decision Processes via Backward Value Functions

Figure 3 for Constrained Markov Decision Processes via Backward Value Functions

Abstract:Although Reinforcement Learning (RL) algorithms have found tremendous success in simulated domains, they often cannot directly be applied to physical systems, especially in cases where there are hard constraints to satisfy (e.g. on safety or resources). In standard RL, the agent is incentivized to explore any behavior as long as it maximizes rewards, but in the real world, undesired behavior can damage either the system or the agent in a way that breaks the learning process itself. In this work, we model the problem of learning with constraints as a Constrained Markov Decision Process and provide a new on-policy formulation for solving it. A key contribution of our approach is to translate cumulative cost constraints into state-based constraints. Through this, we define a safe policy improvement method which maximizes returns while ensuring that the constraints are satisfied at every step. We provide theoretical guarantees under which the agent converges while ensuring safety over the course of training. We also highlight the computational advantages of this approach. The effectiveness of our approach is demonstrated on safe navigation tasks and in safety-constrained versions of MuJoCo environments, with deep neural networks.

Via

Access Paper or Ask Questions

How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

Aug 24, 2020

Prasanna Parthasarathi, Joelle Pineau, Sarath Chandar

Figure 1 for How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

Figure 2 for How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

Figure 3 for How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

Figure 4 for How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

Abstract:Though generative dialogue modeling is widely seen as a language modeling task, the task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user. The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent. Such metrics were earlier shown to not correlate with the human judgement. In this work, we observe that human evaluation of dialogue agents can be inconclusive due to the lack of sufficient information for appropriate evaluation. The automatic metrics are deterministic yet shallow and human evaluation can be relevant yet inconclusive. To bridge this gap in evaluation, we propose designing a set of probing tasks to evaluate dialogue models. The hand-crafted tasks are aimed at quantitatively evaluating a generative dialogue model's understanding beyond the token-level evaluation on the generated text. The probing tasks are deterministic like automatic metrics and requires human judgement in their designing; benefiting from the best of both worlds. With experiments on probe tasks we observe that, unlike RNN based architectures, transformer model may not be learning to comprehend the input text despite its generated text having higher overlap with the target text.

Via

Access Paper or Ask Questions

Multi-Task Reinforcement Learning as a Hidden-Parameter Block MDP

Jul 28, 2020

Amy Zhang, Shagun Sodhani, Khimya Khetarpal, Joelle Pineau

Figure 1 for Multi-Task Reinforcement Learning as a Hidden-Parameter Block MDP

Figure 2 for Multi-Task Reinforcement Learning as a Hidden-Parameter Block MDP

Figure 3 for Multi-Task Reinforcement Learning as a Hidden-Parameter Block MDP

Figure 4 for Multi-Task Reinforcement Learning as a Hidden-Parameter Block MDP

Abstract:Multi-task reinforcement learning is a rich paradigm where information from previously seen environments can be leveraged for better performance and improved sample-efficiency in new environments. In this work, we leverage ideas of common structure underlying a family of Markov decision processes (MDPs) to improve performance in the few-shot regime. We use assumptions of structure from Hidden-Parameter MDPs and Block MDPs to propose a new framework, HiP-BMDP, and approach for learning a common representation and universal dynamics model. To this end, we provide transfer and generalization bounds based on task and state similarity, along with sample complexity bounds that depend on the aggregate number of samples across tasks, rather than the number of tasks, a significant improvement over prior work. To demonstrate the efficacy of the proposed method, we empirically compare and show improvements against other multi-task and meta-reinforcement learning baselines.

* Accepted at the ICML 2020 Workshop on Theoretical Foundations of Reinforcement Learning. 21 pages, 12 figures

Via

Access Paper or Ask Questions

TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

Jul 06, 2020

Joshua Romoff, Peter Henderson, David Kanaa, Emmanuel Bengio, Ahmed Touati, Pierre-Luc Bacon, Joelle Pineau

Figure 1 for TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

Figure 2 for TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

Figure 3 for TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

Figure 4 for TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

Abstract:We investigate whether Jacobi preconditioning, accounting for the bootstrap term in temporal difference (TD) learning, can help boost performance of adaptive optimizers. Our method, TDprop, computes a per parameter learning rate based on the diagonal preconditioning of the TD update rule. We show how this can be used in both $n$-step returns and TD($\lambda$). Our theoretical findings demonstrate that including this additional preconditioning information is, surprisingly, comparable to normal semi-gradient TD if the optimal learning rate is found for both via a hyperparameter search. In Deep RL experiments using Expected SARSA, TDprop meets or exceeds the performance of Adam in all tested games under near-optimal learning rates, but a well-tuned SGD can yield similar improvements -- matching our theory. Our findings suggest that Jacobi preconditioning may improve upon typical adaptive optimization methods in Deep RL, but despite incorporating additional information from the TD bootstrap term, may not always be better than SGD.

* Presented at the Theoretical Foundations of Reinforcement Learning workshop at ICML 2020

Via

Access Paper or Ask Questions

Deep interpretability for GWAS

Jul 03, 2020

Deepak Sharma, Audrey Durand, Marc-André Legault, Louis-Philippe Lemieux Perreault, Audrey Lemaçon, Marie-Pierre Dubé, Joelle Pineau

Figure 1 for Deep interpretability for GWAS

Figure 2 for Deep interpretability for GWAS

Figure 3 for Deep interpretability for GWAS

Abstract:Genome-Wide Association Studies are typically conducted using linear models to find genetic variants associated with common diseases. In these studies, association testing is done on a variant-by-variant basis, possibly missing out on non-linear interaction effects between variants. Deep networks can be used to model these interactions, but they are difficult to train and interpret on large genetic datasets. We propose a method that uses the gradient based deep interpretability technique named DeepLIFT to show that known diabetes genetic risk factors can be identified using deep models along with possibly novel associations.

* Accepted at ICML 2020 workshop on ML Interpretability for Scientific Discovery

Via

Access Paper or Ask Questions

Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

Jun 23, 2020

Paul Barde, Julien Roy, Wonseok Jeon, Joelle Pineau, Christopher Pal, Derek Nowrouzezahrai

Figure 1 for Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

Figure 2 for Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

Figure 3 for Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

Figure 4 for Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

Abstract:Adversarial imitation learning alternates between learning a discriminator -- which tells apart expert's demonstrations from generated ones -- and a generator's policy to produce trajectories that can fool this discriminator. This alternated optimization is known to be delicate in practice since it compounds unstable adversarial training with brittle and sample-inefficient reinforcement learning. We propose to remove the burden of the policy optimization steps by leveraging a novel discriminator formulation. Specifically, our discriminator is explicitly conditioned on two policies: the one from the previous generator's iteration and a learnable policy. When optimized, this discriminator directly learns the optimal generator's policy. Consequently, our discriminator's update solves the generator's optimization problem for free: learning a policy that imitates the expert does not require an additional optimization loop. This formulation effectively cuts by half the implementation and computational burden of adversarial imitation learning algorithms by removing the reinforcement learning phase altogether. We show on a variety of tasks that our simpler approach is competitive to prevalent imitation learning methods.

Via

Access Paper or Ask Questions

Automated Personalized Feedback Improves Learning Gains in an Intelligent Tutoring System

May 07, 2020

Ekaterina Kochmar, Dung Do Vu, Robert Belfer, Varun Gupta, Iulian Vlad Serban, Joelle Pineau

Figure 1 for Automated Personalized Feedback Improves Learning Gains in an Intelligent Tutoring System

Figure 2 for Automated Personalized Feedback Improves Learning Gains in an Intelligent Tutoring System

Abstract:We investigate how automated, data-driven, personalized feedback in a large-scale intelligent tutoring system (ITS) improves student learning outcomes. We propose a machine learning approach to generate personalized feedback, which takes individual needs of students into account. We utilize state-of-the-art machine learning and natural language processing techniques to provide the students with personalized hints, Wikipedia-based explanations, and mathematical hints. Our model is used in Korbit, a large-scale dialogue-based ITS with thousands of students launched in 2019, and we demonstrate that the personalized feedback leads to considerable improvement in student learning outcomes and in the subjective evaluation of the feedback.

* To be published in Proceedings of the the 21st International Conference on Artificial Intelligence in Education (AIED 2020)

Via

Access Paper or Ask Questions

Plan2Vec: Unsupervised Representation Learning by Latent Plans

May 07, 2020

Ge Yang, Amy Zhang, Ari S. Morcos, Joelle Pineau, Pieter Abbeel, Roberto Calandra

Figure 1 for Plan2Vec: Unsupervised Representation Learning by Latent Plans

Figure 2 for Plan2Vec: Unsupervised Representation Learning by Latent Plans

Figure 3 for Plan2Vec: Unsupervised Representation Learning by Latent Plans

Figure 4 for Plan2Vec: Unsupervised Representation Learning by Latent Plans

Abstract:In this paper we introduce plan2vec, an unsupervised representation learning approach that is inspired by reinforcement learning. Plan2vec constructs a weighted graph on an image dataset using near-neighbor distances, and then extrapolates this local metric to a global embedding by distilling path-integral over planned path. When applied to control, plan2vec offers a way to learn goal-conditioned value estimates that are accurate over long horizons that is both compute and sample efficient. We demonstrate the effectiveness of plan2vec on one simulated and two challenging real-world image datasets. Experimental results show that plan2vec successfully amortizes the planning cost, enabling reactive planning that is linear in memory and computation complexity rather than exhaustive over the entire state space.

* Proceedings of Machine Learning Research, the 2nd Annual Conference on Learning for Dynamics and Control (2020) Volume 120, 1-12
* code available at https://geyang.github.io/plan2vec

Via

Access Paper or Ask Questions