Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Doina Precup

McGill University, Mila- Quebec Artificial Intelligence Institute

Discrete Probabilistic Inference as Control in Multi-path Environments

Feb 15, 2024

Tristan Deleu, Padideh Nouri, Nikolay Malkin, Doina Precup, Yoshua Bengio

Figure 1 for Discrete Probabilistic Inference as Control in Multi-path Environments

Figure 2 for Discrete Probabilistic Inference as Control in Multi-path Environments

Figure 3 for Discrete Probabilistic Inference as Control in Multi-path Environments

Figure 4 for Discrete Probabilistic Inference as Control in Multi-path Environments

Abstract:We consider the problem of sampling from a discrete and structured distribution as a sequential decision problem, where the objective is to find a stochastic policy such that objects are sampled at the end of this sequential process proportionally to some predefined reward. While we could use maximum entropy Reinforcement Learning (MaxEnt RL) to solve this problem for some distributions, it has been shown that in general, the distribution over states induced by the optimal policy may be biased in cases where there are multiple ways to generate the same object. To address this issue, Generative Flow Networks (GFlowNets) learn a stochastic policy that samples objects proportionally to their reward by approximately enforcing a conservation of flows across the whole Markov Decision Process (MDP). In this paper, we extend recent methods correcting the reward in order to guarantee that the marginal distribution induced by the optimal MaxEnt RL policy is proportional to the original reward, regardless of the structure of the underlying MDP. We also prove that some flow-matching objectives found in the GFlowNet literature are in fact equivalent to well-established MaxEnt RL algorithms with a corrected reward. Finally, we study empirically the performance of multiple MaxEnt RL and GFlowNet algorithms on multiple problems involving sampling from discrete distributions.

Via

Access Paper or Ask Questions

Mixtures of Experts Unlock Parameter Scaling for Deep RL

Feb 13, 2024

Johan Obando-Ceron, Ghada Sokar, Timon Willi, Clare Lyle, Jesse Farebrother, Jakob Foerster, Gintare Karolina Dziugaite, Doina Precup, Pablo Samuel Castro

Figure 1 for Mixtures of Experts Unlock Parameter Scaling for Deep RL

Figure 2 for Mixtures of Experts Unlock Parameter Scaling for Deep RL

Figure 3 for Mixtures of Experts Unlock Parameter Scaling for Deep RL

Figure 4 for Mixtures of Experts Unlock Parameter Scaling for Deep RL

Abstract:The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.

Via

Access Paper or Ask Questions

On the Privacy of Selection Mechanisms with Gaussian Noise

Feb 09, 2024

Jonathan Lebensold, Doina Precup, Borja Balle

Figure 1 for On the Privacy of Selection Mechanisms with Gaussian Noise

Figure 2 for On the Privacy of Selection Mechanisms with Gaussian Noise

Figure 3 for On the Privacy of Selection Mechanisms with Gaussian Noise

Figure 4 for On the Privacy of Selection Mechanisms with Gaussian Noise

Abstract:Report Noisy Max and Above Threshold are two classical differentially private (DP) selection mechanisms. Their output is obtained by adding noise to a sequence of low-sensitivity queries and reporting the identity of the query whose (noisy) answer satisfies a certain condition. Pure DP guarantees for these mechanisms are easy to obtain when Laplace noise is added to the queries. On the other hand, when instantiated using Gaussian noise, standard analyses only yield approximate DP guarantees despite the fact that the outputs of these mechanisms lie in a discrete space. In this work, we revisit the analysis of Report Noisy Max and Above Threshold with Gaussian noise and show that, under the additional assumption that the underlying queries are bounded, it is possible to provide pure ex-ante DP bounds for Report Noisy Max and pure ex-post DP bounds for Above Threshold. The resulting bounds are tight and depend on closed-form expressions that can be numerically evaluated using standard methods. Empirically we find these lead to tighter privacy accounting in the high privacy, low data regime. Further, we propose a simple privacy filter for composing pure ex-post DP guarantees, and use it to derive a fully adaptive Gaussian Sparse Vector Technique mechanism. Finally, we provide experiments on mobility and energy consumption datasets demonstrating that our Sparse Vector Technique is practically competitive with previous approaches and requires less hyper-parameter tuning.

* AISTATS 2024

Via

Access Paper or Ask Questions

QGFN: Controllable Greediness with Action Values

Feb 07, 2024

Elaine Lau, Stephen Zhewen Lu, Ling Pan, Doina Precup, Emmanuel Bengio

Figure 1 for QGFN: Controllable Greediness with Action Values

Figure 2 for QGFN: Controllable Greediness with Action Values

Figure 3 for QGFN: Controllable Greediness with Action Values

Figure 4 for QGFN: Controllable Greediness with Action Values

Abstract:Generative Flow Networks (GFlowNets; GFNs) are a family of reward/energy-based generative methods for combinatorial objects, capable of generating diverse and high-utility samples. However, biasing GFNs towards producing high-utility samples is non-trivial. In this work, we leverage connections between GFNs and reinforcement learning (RL) and propose to combine the GFN policy with an action-value estimate, $Q$, to create greedier sampling policies which can be controlled by a mixing parameter. We show that several variants of the proposed method, QGFN, are able to improve on the number of high-reward samples generated in a variety of tasks without sacrificing diversity.

* Under review

Via

Access Paper or Ask Questions

Code as Reward: Empowering Reinforcement Learning with VLMs

Feb 07, 2024

David Venuto, Sami Nur Islam, Martin Klissarov, Doina Precup, Sherry Yang, Ankit Anand

Figure 1 for Code as Reward: Empowering Reinforcement Learning with VLMs

Figure 2 for Code as Reward: Empowering Reinforcement Learning with VLMs

Figure 3 for Code as Reward: Empowering Reinforcement Learning with VLMs

Figure 4 for Code as Reward: Empowering Reinforcement Learning with VLMs

Abstract:Pre-trained Vision-Language Models (VLMs) are able to understand visual concepts, describe and decompose complex tasks into sub-tasks, and provide feedback on task completion. In this paper, we aim to leverage these capabilities to support the training of reinforcement learning (RL) agents. In principle, VLMs are well suited for this purpose, as they can naturally analyze image-based observations and provide feedback (reward) on learning progress. However, inference in VLMs is computationally expensive, so querying them frequently to compute rewards would significantly slowdown the training of an RL agent. To address this challenge, we propose a framework named Code as Reward (VLM-CaR). VLM-CaR produces dense reward functions from VLMs through code generation, thereby significantly reducing the computational burden of querying the VLM directly. We show that the dense rewards generated through our approach are very accurate across a diverse set of discrete and continuous environments, and can be more effective in training RL policies than the original sparse environment rewards.

Via

Access Paper or Ask Questions

Effective Protein-Protein Interaction Exploration with PPIretrieval

Feb 06, 2024

Chenqing Hua, Connor Coley, Guy Wolf, Doina Precup, Shuangjia Zheng

Abstract:Protein-protein interactions (PPIs) are crucial in regulating numerous cellular functions, including signal transduction, transportation, and immune defense. As the accuracy of multi-chain protein complex structure prediction improves, the challenge has shifted towards effectively navigating the vast complex universe to identify potential PPIs. Herein, we propose PPIretrieval, the first deep learning-based model for protein-protein interaction exploration, which leverages existing PPI data to effectively search for potential PPIs in an embedding space, capturing rich geometric and chemical information of protein surfaces. When provided with an unseen query protein with its associated binding site, PPIretrieval effectively identifies a potential binding partner along with its corresponding binding site in an embedding space, facilitating the formation of protein-protein complexes.

Via

Access Paper or Ask Questions

Prediction and Control in Continual Reinforcement Learning

Dec 18, 2023

Nishanth Anand, Doina Precup

Abstract:Temporal difference (TD) learning is often used to update the estimate of the value function which is used by RL agents to extract useful policies. In this paper, we focus on value function estimation in continual reinforcement learning. We propose to decompose the value function into two components which update at different timescales: a permanent value function, which holds general knowledge that persists over time, and a transient value function, which allows quick adaptation to new situations. We establish theoretical results showing that our approach is well suited for continual learning and draw connections to the complementary learning systems (CLS) theory from neuroscience. Empirically, this approach improves performance significantly on both prediction and control problems.

* Published at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

Via

Access Paper or Ask Questions

Nash Learning from Human Feedback

Dec 06, 2023

Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi(+7 more)

Figure 1 for Nash Learning from Human Feedback

Figure 2 for Nash Learning from Human Feedback

Figure 3 for Nash Learning from Human Feedback

Figure 4 for Nash Learning from Human Feedback

Abstract:Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

Via

Access Paper or Ask Questions

Learning domain-invariant classifiers for infant cry sounds

Nov 30, 2023

Charles C. Onu, Hemanth K. Sheetha, Arsenii Gorin, Doina Precup

Figure 1 for Learning domain-invariant classifiers for infant cry sounds

Figure 2 for Learning domain-invariant classifiers for infant cry sounds

Figure 3 for Learning domain-invariant classifiers for infant cry sounds

Figure 4 for Learning domain-invariant classifiers for infant cry sounds

Abstract:The issue of domain shift remains a problematic phenomenon in most real-world datasets and clinical audio is no exception. In this work, we study the nature of domain shift in a clinical database of infant cry sounds acquired across different geographies. We find that though the pitches of infant cries are similarly distributed regardless of the place of birth, other characteristics introduce peculiar biases into the data. We explore methodologies for mitigating the impact of domain shift in a model for identifying neurological injury from cry sounds. We adapt unsupervised domain adaptation methods from computer vision which learn an audio representation that is domain-invariant to hospitals and is task discriminative. We also propose a new approach, target noise injection (TNI), for unsupervised domain adaptation which requires neither labels nor training data from the target domain. Our best-performing model significantly improves target accuracy by 7.2%, without negatively affecting the source domain.

Via

Access Paper or Ask Questions

Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search

Nov 06, 2023

Abbas Mehrabian, Ankit Anand, Hyunjik Kim, Nicolas Sonnerat, Matej Balog, Gheorghe Comanici, Tudor Berariu, Andrew Lee, Anian Ruoss, Anna Bulanova(+9 more)

Figure 1 for Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search

Figure 2 for Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search

Figure 3 for Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search

Figure 4 for Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search

Abstract:This work studies a central extremal graph theory problem inspired by a 1975 conjecture of Erd\H{o}s, which aims to find graphs with a given size (number of nodes) that maximize the number of edges without having 3- or 4-cycles. We formulate this problem as a sequential decision-making problem and compare AlphaZero, a neural network-guided tree search, with tabu search, a heuristic local search method. Using either method, by introducing a curriculum -- jump-starting the search for larger graphs using good graphs found at smaller sizes -- we improve the state-of-the-art lower bounds for several sizes. We also propose a flexible graph-generation environment and a permutation-invariant network architecture for learning to search in the space of graphs.

* Accepted at MATH AI workshop at NeurIPS 2023, First three authors contributed equally, Last two authors have equal senior contribution

Via

Access Paper or Ask Questions