Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aaron Courville

Universite de Montreal

Scattered Mixture-of-Experts Implementation

Mar 13, 2024

Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville

Abstract:We present ScatterMoE, an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs. ScatterMoE builds upon existing implementations, and overcoming some of the limitations to improve inference and training speed, and memory footprint. This implementation achieves this by avoiding padding and making excessive copies of the input. We introduce ParallelLinear, the main component we use to build our implementation and the various kernels used to speed up the operation. We benchmark our implementation against Megablocks, and show that it enables a higher throughput and lower memory footprint. We also show how ParallelLinear enables extension of the Mixture-of-Experts concept by demonstrating with an implementation of Mixture of Attention.

Via

Access Paper or Ask Questions

In deep reinforcement learning, a pruned network is a good network

Feb 19, 2024

Johan Obando-Ceron, Aaron Courville, Pablo Samuel Castro

Figure 1 for In deep reinforcement learning, a pruned network is a good network

Figure 2 for In deep reinforcement learning, a pruned network is a good network

Figure 3 for In deep reinforcement learning, a pruned network is a good network

Figure 4 for In deep reinforcement learning, a pruned network is a good network

Abstract:Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage prior insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks and exhibit a type of "scaling law", using only a small fraction of the full network parameters.

Via

Access Paper or Ask Questions

V-STaR: Training Verifiers for Self-Taught Reasoners

Feb 09, 2024

Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal

Figure 1 for V-STaR: Training Verifiers for Self-Taught Reasoners

Figure 2 for V-STaR: Training Verifiers for Self-Taught Reasoners

Figure 3 for V-STaR: Training Verifiers for Self-Taught Reasoners

Figure 4 for V-STaR: Training Verifiers for Self-Taught Reasoners

Abstract:Common self-improvement approaches for large language models (LLMs), such as STaR (Zelikman et al., 2022), iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.

Via

Access Paper or Ask Questions

Language Model Alignment with Elastic Reset

Dec 06, 2023

Michael Noukhovitch, Samuel Lavoie, Florian Strub, Aaron Courville

Abstract:Finetuning language models with reinforcement learning (RL), e.g. from human feedback (HF), is a prominent method for alignment. But optimizing against a reward model can improve on reward while degrading performance in other areas, a phenomenon known as reward hacking, alignment tax, or language drift. First, we argue that commonly-used test metrics are insufficient and instead measure how different algorithms tradeoff between reward and drift. The standard method modified the reward with a Kullback-Lieber (KL) penalty between the online and initial model. We propose Elastic Reset, a new algorithm that achieves higher reward with less drift without explicitly modifying the training objective. We periodically reset the online model to an exponentially moving average (EMA) of itself, then reset the EMA model to the initial model. Through the use of an EMA, our model recovers quickly after resets and achieves higher reward with less drift in the same number of steps. We demonstrate that fine-tuning language models with Elastic Reset leads to state-of-the-art performance on a small scale pivot-translation benchmark, outperforms all baselines in a medium-scale RLHF-like IMDB mock sentiment task and leads to a more performant and more aligned technical QA chatbot with LLaMA-7B. Code available at github.com/mnoukhov/elastic-reset.

* Published at NeurIPS 2023

Via

Access Paper or Ask Questions

Learning and Controlling Silicon Dopant Transitions in Graphene using Scanning Transmission Electron Microscopy

Nov 21, 2023

Max Schwarzer, Jesse Farebrother, Joshua Greaves, Ekin Dogus Cubuk, Rishabh Agarwal, Aaron Courville, Marc G. Bellemare, Sergei Kalinin, Igor Mordatch, Pablo Samuel Castro(+1 more)

Abstract:We introduce a machine learning approach to determine the transition dynamics of silicon atoms on a single layer of carbon atoms, when stimulated by the electron beam of a scanning transmission electron microscope (STEM). Our method is data-centric, leveraging data collected on a STEM. The data samples are processed and filtered to produce symbolic representations, which we use to train a neural network to predict transition probabilities. These learned transition dynamics are then leveraged to guide a single silicon atom throughout the lattice to pre-determined target destinations. We present empirical analyses that demonstrate the efficacy and generality of our approach.

Via

Access Paper or Ask Questions

Improving Compositional Generalization Using Iterated Learning and Simplicial Embeddings

Oct 28, 2023

Yi Ren, Samuel Lavoie, Mikhail Galkin, Danica J. Sutherland, Aaron Courville

Figure 1 for Improving Compositional Generalization Using Iterated Learning and Simplicial Embeddings

Figure 2 for Improving Compositional Generalization Using Iterated Learning and Simplicial Embeddings

Figure 3 for Improving Compositional Generalization Using Iterated Learning and Simplicial Embeddings

Figure 4 for Improving Compositional Generalization Using Iterated Learning and Simplicial Embeddings

Abstract:Compositional generalization, the ability of an agent to generalize to unseen combinations of latent factors, is easy for humans but hard for deep neural networks. A line of research in cognitive science has hypothesized a process, ``iterated learning,'' to help explain how human language developed this ability; the theory rests on simultaneous pressures towards compressibility (when an ignorant agent learns from an informed one) and expressivity (when it uses the representation for downstream tasks). Inspired by this process, we propose to improve the compositional generalization of deep networks by using iterated learning on models with simplicial embeddings, which can approximately discretize representations. This approach is further motivated by an analysis of compositionality based on Kolmogorov complexity. We show that this combination of changes improves compositional generalization over other approaches, demonstrating these improvements both on vision tasks with well-understood latent factors and on real molecular graph prediction tasks where the latent structure is unknown.

Via

Access Paper or Ask Questions

Group Robust Classification Without Any Group Information

Oct 28, 2023

Christos Tsirigotis, Joao Monteiro, Pau Rodriguez, David Vazquez, Aaron Courville

Figure 1 for Group Robust Classification Without Any Group Information

Figure 2 for Group Robust Classification Without Any Group Information

Figure 3 for Group Robust Classification Without Any Group Information

Figure 4 for Group Robust Classification Without Any Group Information

Abstract:Empirical risk minimization (ERM) is sensitive to spurious correlations in the training data, which poses a significant risk when deploying systems trained under this paradigm in high-stake applications. While the existing literature focuses on maximizing group-balanced or worst-group accuracy, estimating these accuracies is hindered by costly bias annotations. This study contends that current bias-unsupervised approaches to group robustness continue to rely on group information to achieve optimal performance. Firstly, these methods implicitly assume that all group combinations are represented during training. To illustrate this, we introduce a systematic generalization task on the MPI3D dataset and discover that current algorithms fail to improve the ERM baseline when combinations of observed attribute values are missing. Secondly, bias labels are still crucial for effective model selection, restricting the practicality of these methods in real-world scenarios. To address these limitations, we propose a revised methodology for training and validating debiased models in an entirely bias-unsupervised manner. We achieve this by employing pretrained self-supervised models to reliably extract bias information, which enables the integration of a logit adjustment training loss with our validation criterion. Our empirical analysis on synthetic and real-world tasks provides evidence that our approach overcomes the identified challenges and consistently enhances robust accuracy, attaining performance which is competitive with or outperforms that of state-of-the-art methods, which, conversely, rely on bias labels for validation.

* Accepted at the 37th Conference on Neural Information Processing Systems (NeurIPS 2023). Code is available at https://github.com/tsirif/uLA

Via

Access Paper or Ask Questions

Sparse Universal Transformer

Oct 11, 2023

Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron Courville, Chuang Gan

Figure 1 for Sparse Universal Transformer

Figure 2 for Sparse Universal Transformer

Figure 3 for Sparse Universal Transformer

Figure 4 for Sparse Universal Transformer

Abstract:The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers. Empirical evidence shows that UTs have better compositional generalization than Vanilla Transformers (VTs) in formal language tasks. The parameter-sharing also affords it better parameter efficiency than VTs. Despite its many advantages, scaling UT parameters is much more compute and memory intensive than scaling up a VT. This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) and a new stick-breaking-based dynamic halting mechanism to reduce UT's computation complexity while retaining its parameter efficiency and generalization ability. Experiments show that SUT achieves the same performance as strong baseline models while only using half computation and parameters on WMT'14 and strong generalization results on formal language tasks (Logical inference and CFQ). The new halting mechanism also enables around 50\% reduction in computation during inference with very little performance decrease on formal language tasks.

Via

Access Paper or Ask Questions

Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization

Oct 04, 2023

Dinghuai Zhang, Ricky Tian Qi Chen, Cheng-Hao Liu, Aaron Courville, Yoshua Bengio

Figure 1 for Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization

Figure 2 for Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization

Figure 3 for Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization

Figure 4 for Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization

Abstract:We tackle the problem of sampling from intractable high-dimensional density functions, a fundamental task that often appears in machine learning and statistics. We extend recent sampling-based approaches that leverage controlled stochastic processes to model approximate samples from these target densities. The main drawback of these approaches is that the training objective requires full trajectories to compute, resulting in sluggish credit assignment issues due to use of entire trajectories and a learning signal present only at the terminal time. In this work, we present Diffusion Generative Flow Samplers (DGFS), a sampling-based framework where the learning process can be tractably broken down into short partial trajectory segments, via parameterizing an additional "flow function". Our method takes inspiration from the theory developed for generative flow networks (GFlowNets), allowing us to make use of intermediate learning signals and benefit from off-policy exploration capabilities. Through a variety of challenging experiments, we demonstrate that DGFS results in more accurate estimates of the normalization constant than closely-related prior methods.

Via

Access Paper or Ask Questions

Meta-Value Learning: a General Framework for Learning with Learning Awareness

Jul 17, 2023

Tim Cooijmans, Milad Aghajohari, Aaron Courville

Figure 1 for Meta-Value Learning: a General Framework for Learning with Learning Awareness

Figure 2 for Meta-Value Learning: a General Framework for Learning with Learning Awareness

Figure 3 for Meta-Value Learning: a General Framework for Learning with Learning Awareness

Figure 4 for Meta-Value Learning: a General Framework for Learning with Learning Awareness

Abstract:Gradient-based learning in multi-agent systems is difficult because the gradient derives from a first-order model which does not account for the interaction between agents' learning processes. LOLA (arXiv:1709.04326) accounts for this by differentiating through one step of optimization. We extend the ideas of LOLA and develop a fully-general value-based approach to optimization. At the core is a function we call the meta-value, which at each point in joint-policy space gives for each agent a discounted sum of its objective over future optimization steps. We argue that the gradient of the meta-value gives a more reliable improvement direction than the gradient of the original objective, because the meta-value derives from empirical observations of the effects of optimization. We show how the meta-value can be approximated by training a neural network to minimize TD error along optimization trajectories in which agents follow the gradient of the meta-value. We analyze the behavior of our method on the Logistic Game and on the Iterated Prisoner's Dilemma.

* Submitted to NeurIPS 2023

Via

Access Paper or Ask Questions