Alert button
Picture for Matthew Riemer

Matthew Riemer

Alert button

Finding Macro-Actions with Disentangled Effects for Efficient Planning with the Goal-Count Heuristic

Apr 28, 2020
Cameron Allen, Tim Klinger, George Konidaris, Matthew Riemer, Gerald Tesauro

Figure 1 for Finding Macro-Actions with Disentangled Effects for Efficient Planning with the Goal-Count Heuristic
Figure 2 for Finding Macro-Actions with Disentangled Effects for Efficient Planning with the Goal-Count Heuristic
Figure 3 for Finding Macro-Actions with Disentangled Effects for Efficient Planning with the Goal-Count Heuristic
Figure 4 for Finding Macro-Actions with Disentangled Effects for Efficient Planning with the Goal-Count Heuristic

The difficulty of classical planning increases exponentially with search-tree depth. Heuristic search can make planning more efficient, but good heuristics often require domain-specific assumptions and may not generalize to new problems. Rather than treating the planning problem as fixed and carefully designing a heuristic to match it, we instead construct macro-actions that support efficient planning with the simple and general-purpose "goal-count" heuristic. Our approach searches for macro-actions that modify only a small number of state variables (we call this measure "entanglement"). We show experimentally that reducing entanglement exponentially decreases planning time with the goal-count heuristic. Our method discovers macro-actions with disentangled effects that dramatically improve planning efficiency for 15-puzzle and Rubik's cube, reliably solving each domain without prior knowledge, and solving Rubik's cube with orders of magnitude less data than competing approaches.

* Code available at https://github.com/camall3n/skills-for-planning 
Viaarxiv icon

On the Role of Weight Sharing During Deep Option Learning

Feb 06, 2020
Matthew Riemer, Ignacio Cases, Clemens Rosenbaum, Miao Liu, Gerald Tesauro

Figure 1 for On the Role of Weight Sharing During Deep Option Learning
Figure 2 for On the Role of Weight Sharing During Deep Option Learning
Figure 3 for On the Role of Weight Sharing During Deep Option Learning
Figure 4 for On the Role of Weight Sharing During Deep Option Learning

The options framework is a popular approach for building temporally extended actions in reinforcement learning. In particular, the option-critic architecture provides general purpose policy gradient theorems for learning actions from scratch that are extended in time. However, past work makes the key assumption that each of the components of option-critic has independent parameters. In this work we note that while this key assumption of the policy gradient theorems of option-critic holds in the tabular case, it is always violated in practice for the deep function approximation setting. We thus reconsider this assumption and consider more general extensions of option-critic and hierarchical option-critic training that optimize for the full architecture with each update. It turns out that not assuming parameter independence challenges a belief in prior work that training the policy over options can be disentangled from the dynamics of the underlying options. In fact, learning can be sped up by focusing the policy over options on states where options are actually likely to terminate. We put our new algorithms to the test in application to sample efficient learning of Atari games, and demonstrate significantly improved stability and faster convergence when learning long options.

* AAAI 2020 
Viaarxiv icon

Hierarchical Average Reward Policy Gradient Algorithms

Nov 20, 2019
Akshay Dharmavaram, Matthew Riemer, Shalabh Bhatnagar

Figure 1 for Hierarchical Average Reward Policy Gradient Algorithms
Figure 2 for Hierarchical Average Reward Policy Gradient Algorithms
Figure 3 for Hierarchical Average Reward Policy Gradient Algorithms

Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments. In this work, we address this issue by extending the hierarchical option-critic policy gradient theorem for the average reward criterion. Our proposed framework aims to maximize the long-term reward obtained in the steady-state of the Markov chain defined by the agent's policy. Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one. Finally, we illustrate the competitive advantage of learning options, in the average reward setting, on a grid-world environment with sparse rewards.

* 6 pages, 3 figures, to be published in Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence 
Viaarxiv icon

Routing Networks and the Challenges of Modular and Compositional Computation

Apr 29, 2019
Clemens Rosenbaum, Ignacio Cases, Matthew Riemer, Tim Klinger

Figure 1 for Routing Networks and the Challenges of Modular and Compositional Computation
Figure 2 for Routing Networks and the Challenges of Modular and Compositional Computation
Figure 3 for Routing Networks and the Challenges of Modular and Compositional Computation
Figure 4 for Routing Networks and the Challenges of Modular and Compositional Computation

Compositionality is a key strategy for addressing combinatorial complexity and the curse of dimensionality. Recent work has shown that compositional solutions can be learned and offer substantial gains across a variety of domains, including multi-task learning, language modeling, visual question answering, machine comprehension, and others. However, such models present unique challenges during training when both the module parameters and their composition must be learned jointly. In this paper, we identify several of these issues and analyze their underlying causes. Our discussion focuses on routing networks, a general approach to this problem, and examines empirically the interplay of these challenges and a variety of design decisions. In particular, we consider the effect of how the algorithm decides on module composition, how the algorithm updates the modules, and if the algorithm uses regularization.

Viaarxiv icon

Continual Learning with Self-Organizing Maps

Apr 19, 2019
Pouya Bashivan, Martin Schrimpf, Robert Ajemian, Irina Rish, Matthew Riemer, Yuhai Tu

Figure 1 for Continual Learning with Self-Organizing Maps
Figure 2 for Continual Learning with Self-Organizing Maps
Figure 3 for Continual Learning with Self-Organizing Maps
Figure 4 for Continual Learning with Self-Organizing Maps

Despite remarkable successes achieved by modern neural networks in a wide range of applications, these networks perform best in domain-specific stationary environments where they are trained only once on large-scale controlled data repositories. When exposed to non-stationary learning environments, current neural networks tend to forget what they had previously learned, a phenomena known as catastrophic forgetting. Most previous approaches to this problem rely on memory replay buffers which store samples from previously learned tasks, and use them to regularize the learning on new ones. This approach suffers from the important disadvantage of not scaling well to real-life problems in which the memory requirements become enormous. We propose a memoryless method that combines standard supervised neural networks with self-organizing maps to solve the continual learning problem. The role of the self-organizing map is to adaptively cluster the inputs into appropriate task contexts - without explicit labels - and allocate network resources accordingly. Thus, it selectively routes the inputs in accord with previous experience, ensuring that past learning is maintained and does not interfere with current learning. Out method is intuitive, memoryless, and performs on par with current state-of-the-art approaches on standard benchmarks.

* Continual Learning Workshop - NeurIPS 2018 
Viaarxiv icon

Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning

Mar 07, 2019
Dong Ki Kim, Miao Liu, Shayegan Omidshafiei, Sebastian Lopez-Cot, Matthew Riemer, Golnaz Habibi, Gerald Tesauro, Sami Mourad, Murray Campbell, Jonathan P. How

Figure 1 for Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning
Figure 2 for Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning
Figure 3 for Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning
Figure 4 for Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning

Heterogeneous knowledge naturally arises among different agents in cooperative multiagent reinforcement learning. As such, learning can be greatly improved if agents can effectively pass their knowledge on to other agents. Existing work has demonstrated that peer-to-peer knowledge transfer, a process referred to as action advising, improves team-wide learning. In contrast to previous frameworks that advise at the level of primitive actions, we aim to learn high-level teaching policies that decide when and what high-level action (e.g., sub-goal) to advise a teammate. We introduce a new learning to teach framework, called hierarchical multiagent teaching (HMAT). The proposed framework solves difficulties faced by prior work on multiagent teaching when operating in domains with long horizons, delayed rewards, and continuous states/actions by leveraging temporal abstraction and deep function approximation. Our empirical evaluations show that HMAT accelerates team-wide learning progress in difficult environments that are more complex than those explored in previous work. HMAT also learns teaching policies that can be transferred to different teammates/tasks and can even teach teammates with heterogeneous action spaces.

Viaarxiv icon

Learning Abstract Options

Nov 06, 2018
Matthew Riemer, Miao Liu, Gerald Tesauro

Figure 1 for Learning Abstract Options
Figure 2 for Learning Abstract Options
Figure 3 for Learning Abstract Options
Figure 4 for Learning Abstract Options

Building systems that autonomously create temporal abstractions from data is a key challenge in scaling learning and planning in reinforcement learning. One popular approach for addressing this challenge is the options framework (Sutton et al., 1999). However, only recently in (Bacon et al., 2017) was a policy gradient theorem derived for online learning of general purpose options in an end to end fashion. In this work, we extend previous work on this topic that only focuses on learning a two-level hierarchy including options and primitive actions to enable learning simultaneously at multiple resolutions in time. We achieve this by considering an arbitrarily deep hierarchy of options where high level temporally extended options are composed of lower level options with finer resolutions in time. We extend results from (Bacon et al., 2017) and derive policy gradient theorems for a deep hierarchy of options. Our proposed hierarchical option-critic architecture is capable of learning internal policies, termination conditions, and hierarchical compositions over options without the need for any intrinsic rewards or subgoals. Our empirical results in both discrete and continuous environments demonstrate the efficiency of our framework.

* NIPS 2018 
Viaarxiv icon

Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference

Oct 29, 2018
Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, Gerald Tesauro

Figure 1 for Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference
Figure 2 for Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference
Figure 3 for Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference
Figure 4 for Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference

Lack of performance when it comes to continual learning over non-stationary distributions of data remains a major challenge in scaling neural network learning to more human realistic settings. In this work we propose a new conceptualization of the continual learning problem in terms of a trade-off between transfer and interference. We then propose a new algorithm, Meta-Experience Replay (MER), that directly exploits this view by combining experience replay with optimization based meta-learning. This method learns parameters that make interference based on future gradients less likely and transfer based on future gradients more likely. We conduct experiments across continual lifelong supervised learning benchmarks and non-stationary reinforcement learning environments demonstrating that our approach consistently outperforms recently proposed baselines for continual learning. Our experiments show that the gap between the performance of MER and baseline algorithms grows both as the environment gets more non-stationary and as the fraction of the total experiences stored gets smaller.

Viaarxiv icon

PepCVAE: Semi-Supervised Targeted Design of Antimicrobial Peptide Sequences

Oct 22, 2018
Payel Das, Kahini Wadhawan, Oscar Chang, Tom Sercu, Cicero Dos Santos, Matthew Riemer, Inkit Padhi, Vijil Chenthamarakshan, Aleksandra Mojsilovic

Figure 1 for PepCVAE: Semi-Supervised Targeted Design of Antimicrobial Peptide Sequences
Figure 2 for PepCVAE: Semi-Supervised Targeted Design of Antimicrobial Peptide Sequences
Figure 3 for PepCVAE: Semi-Supervised Targeted Design of Antimicrobial Peptide Sequences
Figure 4 for PepCVAE: Semi-Supervised Targeted Design of Antimicrobial Peptide Sequences

Given the emerging global threat of antimicrobial resistance, new methods for next-generation antimicrobial design are urgently needed. We report a peptide generation framework PepCVAE, based on a semi-supervised variational autoencoder (VAE) model, for designing novel antimicrobial peptide (AMP) sequences. Our model learns a rich latent space of the biological peptide context by taking advantage of abundant, unlabeled peptide sequences. The model further learns a disentangled antimicrobial attribute space by using the feedback from a jointly trained AMP classifier that uses limited labeled instances. The disentangled representation allows for controllable generation of AMPs. Extensive analysis of the PepCVAE-generated sequences reveals superior performance of our model in comparison to a plain VAE, as PepCVAE generates novel AMP sequences with higher long-range diversity, while being closer to the training distribution of biological peptides. These features are highly desired in next-generation antimicrobial design.

Viaarxiv icon