Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lucas Caccia

MILA

Multi-Head Adapter Routing for Data-Efficient Fine-Tuning

Nov 07, 2022

Lucas Caccia, Edoardo Ponti, Lucas Liu, Matheus Pereira, Nicolas Le Roux, Alessandro Sordoni

Abstract:Parameter-efficient fine-tuning (PEFT) methods can adapt large language models to downstream tasks by training a small amount of newly added parameters. In multi-task settings, PEFT adapters typically train on each task independently, inhibiting transfer across tasks, or on the concatenation of all tasks, which can lead to negative interference. To address this, Polytropon (Ponti et al.) jointly learns an inventory of PEFT adapters and a routing function to share variable-size sets of adapters across tasks. Subsequently, adapters can be re-combined and fine-tuned on novel tasks even with limited data. In this paper, we investigate to what extent the ability to control which adapters are active for each task leads to sample-efficient generalization. Thus, we propose less expressive variants where we perform weighted averaging of the adapters before few-shot adaptation (Poly-mu) instead of learning a routing function. Moreover, we introduce more expressive variants where finer-grained task-adapter allocation is learned through a multi-head routing function (Poly-S). We test these variants on three separate benchmarks for multi-task learning. We find that Poly-S achieves gains on all three (up to 5.3 points on average) over strong baselines, while incurring a negligible additional cost in parameter count. In particular, we find that instruction tuning, where models are fully fine-tuned on natural language instructions for each task, is inferior to modular methods such as Polytropon and our proposed variants.

* Preprint

Via

Access Paper or Ask Questions

New Insights on Reducing Abrupt Representation Change in Online Continual Learning

Mar 08, 2022

Lucas Caccia, Rahaf Aljundi, Nader Asadi, Tinne Tuytelaars, Joelle Pineau, Eugene Belilovsky

Figure 1 for New Insights on Reducing Abrupt Representation Change in Online Continual Learning

Figure 2 for New Insights on Reducing Abrupt Representation Change in Online Continual Learning

Figure 3 for New Insights on Reducing Abrupt Representation Change in Online Continual Learning

Figure 4 for New Insights on Reducing Abrupt Representation Change in Online Continual Learning

Abstract:In the online continual learning paradigm, agents must learn from a changing distribution while respecting memory and compute constraints. Experience Replay (ER), where a small subset of past data is stored and replayed alongside new data, has emerged as a simple and effective learning strategy. In this work, we focus on the change in representations of observed data that arises when previously unobserved classes appear in the incoming data stream, and new classes must be distinguished from previous ones. We shed new light on this question by showing that applying ER causes the newly added classes' representations to overlap significantly with the previous classes, leading to highly disruptive parameter updates. Based on this empirical analysis, we propose a new method which mitigates this issue by shielding the learned representations from drastic adaptation to accommodate new classes. We show that using an asymmetric update rule pushes new classes to adapt to the older ones (rather than the reverse), which is more effective especially at task boundaries, where much of the forgetting typically occurs. Empirical results show significant gains over strong baselines on standard continual learning benchmarks

* Accepted at ICLR 2022. Code Available at https://github.com/pclucas14/AML

Via

Access Paper or Ask Questions

On Anytime Learning at Macroscale

Jun 17, 2021

Lucas Caccia, Jing Xu, Myle Ott, Marc'Aurelio Ranzato, Ludovic Denoyer

Figure 1 for On Anytime Learning at Macroscale

Figure 2 for On Anytime Learning at Macroscale

Figure 3 for On Anytime Learning at Macroscale

Figure 4 for On Anytime Learning at Macroscale

Abstract:Classical machine learning frameworks assume access to a possibly large dataset in order to train a predictive model. In many practical applications however, data does not arrive all at once, but in batches over time. This creates a natural trade-off between accuracy of a model and time to obtain such a model. A greedy predictor could produce non-trivial predictions by immediately training on batches as soon as these become available but, it may also make sub-optimal use of future data. On the other hand, a tardy predictor could wait for a long time to aggregate several batches into a larger dataset, but ultimately deliver a much better performance. In this work, we consider such a streaming learning setting, which we dub {\em anytime learning at macroscale} (ALMA). It is an instance of anytime learning applied not at the level of a single chunk of data, but at the level of the entire sequence of large batches. We first formalize this learning setting, we then introduce metrics to assess how well learners perform on the given task for a given memory and compute budget, and finally we test several baseline approaches on standard benchmarks repurposed for anytime learning at macroscale. The general finding is that bigger models always generalize better. In particular, it is important to grow model capacity over time if the initial model is relatively small. Moreover, updating the model at an intermediate rate strikes the best trade off between accuracy and time to obtain a useful predictor.

Via

Access Paper or Ask Questions

SPeCiaL: Self-Supervised Pretraining for Continual Learning

Jun 16, 2021

Lucas Caccia, Joelle Pineau

Figure 1 for SPeCiaL: Self-Supervised Pretraining for Continual Learning

Figure 2 for SPeCiaL: Self-Supervised Pretraining for Continual Learning

Figure 3 for SPeCiaL: Self-Supervised Pretraining for Continual Learning

Figure 4 for SPeCiaL: Self-Supervised Pretraining for Continual Learning

Abstract:This paper presents SPeCiaL: a method for unsupervised pretraining of representations tailored for continual learning. Our approach devises a meta-learning objective that differentiates through a sequential learning process. Specifically, we train a linear model over the representations to match different augmented views of the same image together, each view presented sequentially. The linear model is then evaluated on both its ability to classify images it just saw, and also on images from previous iterations. This gives rise to representations that favor quick knowledge retention with minimal forgetting. We evaluate SPeCiaL in the Continual Few-Shot Learning setting, and show that it can match or outperform other supervised pretraining approaches.

Via

Access Paper or Ask Questions

Decoupled Greedy Learning of CNNs for Synchronous and Asynchronous Distributed Learning

Jun 11, 2021

Eugene Belilovsky, Louis Leconte, Lucas Caccia, Michael Eickenberg, Edouard Oyallon

Figure 1 for Decoupled Greedy Learning of CNNs for Synchronous and Asynchronous Distributed Learning

Figure 2 for Decoupled Greedy Learning of CNNs for Synchronous and Asynchronous Distributed Learning

Figure 3 for Decoupled Greedy Learning of CNNs for Synchronous and Asynchronous Distributed Learning

Figure 4 for Decoupled Greedy Learning of CNNs for Synchronous and Asynchronous Distributed Learning

Abstract:A commonly cited inefficiency of neural network training using back-propagation is the update locking problem: each layer must wait for the signal to propagate through the full network before updating. Several alternatives that can alleviate this issue have been proposed. In this context, we consider a simple alternative based on minimal feedback, which we call Decoupled Greedy Learning (DGL). It is based on a classic greedy relaxation of the joint training objective, recently shown to be effective in the context of Convolutional Neural Networks (CNNs) on large-scale image classification. We consider an optimization of this objective that permits us to decouple the layer training, allowing for layers or modules in networks to be trained with a potentially linear parallelization. With the use of a replay buffer we show that this approach can be extended to asynchronous settings, where modules can operate and continue to update with possibly large communication delays. To address bandwidth and memory issues we propose an approach based on online vector quantization. This allows to drastically reduce the communication bandwidth between modules and required memory for replay buffers. We show theoretically and empirically that this approach converges and compare it to the sequential solvers. We demonstrate the effectiveness of DGL against alternative approaches on the CIFAR-10 dataset and on the large-scale ImageNet dataset.

* arXiv admin note: substantial text overlap with arXiv:1901.08164

Via

Access Paper or Ask Questions

Reducing Representation Drift in Online Continual Learning

Apr 11, 2021

Lucas Caccia, Rahaf Aljundi, Tinne Tuytelaars, Joelle Pineau, Eugene Belilovsky

Figure 1 for Reducing Representation Drift in Online Continual Learning

Figure 2 for Reducing Representation Drift in Online Continual Learning

Figure 3 for Reducing Representation Drift in Online Continual Learning

Figure 4 for Reducing Representation Drift in Online Continual Learning

Abstract:We study the online continual learning paradigm, where agents must learn from a changing distribution with constrained memory and compute. Previous work often tackle catastrophic forgetting by overcoming changes in the space of model parameters. In this work we instead focus on the change in representations of previously observed data due to the introduction of previously unobserved class samples in the incoming data stream. We highlight the issues that arise in the practical setting where new classes must be distinguished between all previous classes. Starting from a popular approach, experience replay, we consider a metric learning based loss function, the triplet loss, which allows us to more explicitly constrain the behavior of representations. We hypothesize and empirically confirm that the selection of negatives used in the triplet loss plays a major role in the representation change, or drift, of previously observed data and can be greatly reduced by appropriate negative selection. Motivated by this we further introduce a simple adjustment to the standard cross entropy loss used in prior experience replay that achieves similar effect. Our approach greatly improves the performance of experience replay and obtains state-of-the-art on several existing benchmarks in online continual learning, while remaining efficient in both memory and compute.

Via

Access Paper or Ask Questions

Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual Learning

Mar 12, 2020

Massimo Caccia, Pau Rodriguez, Oleksiy Ostapenko, Fabrice Normandin, Min Lin, Lucas Caccia, Issam Laradji, Irina Rish, Alexande Lacoste, David Vazquez(+1 more)

Figure 1 for Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual Learning

Figure 2 for Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual Learning

Figure 3 for Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual Learning

Figure 4 for Online Fast Adaptation and Knowledge Accumulation: a New Approach to Continual Learning

Abstract:Learning from non-stationary data remains a great challenge for machine learning. Continual learning addresses this problem in scenarios where the learning agent faces a stream of changing tasks. In these scenarios, the agent is expected to retain its highest performance on previous tasks without revisiting them while adapting well to the new tasks. Two new recent continual-learning scenarios have been proposed. In meta-continual learning, the model is pre-trained to minimize catastrophic forgetting when trained on a sequence of tasks. In continual-meta learning, the goal is faster remembering, i.e., focusing on how quickly the agent recovers performance rather than measuring the agent's performance without any adaptation. Both scenarios have the potential to propel the field forward. Yet in their original formulations, they each have limitations. As a remedy, we propose a more general scenario where an agent must quickly solve (new) out-of-distribution tasks, while also requiring fast remembering. We show that current continual learning, meta learning, meta-continual learning, and continual-meta learning techniques fail in this new scenario. Accordingly, we propose a strong baseline: Continual-MAML, an online extension of the popular MAML algorithm. In our empirical experiments, we show that our method is better suited to the new scenario than the methodologies mentioned above, as well as standard continual learning and meta learning approaches.

Via

Access Paper or Ask Questions

Online Learned Continual Compression with Stacked Quantization Module

Nov 19, 2019

Lucas Caccia, Eugene Belilovsky, Massimo Caccia, Joelle Pineau

Figure 1 for Online Learned Continual Compression with Stacked Quantization Module

Figure 2 for Online Learned Continual Compression with Stacked Quantization Module

Figure 3 for Online Learned Continual Compression with Stacked Quantization Module

Figure 4 for Online Learned Continual Compression with Stacked Quantization Module

Abstract:We introduce and study the problem of Online Continual Compression, where one attempts to learn to compress and store a representative dataset from a non i.i.d data stream, while only observing each sample once. This problem is highly relevant for downstream online continual learning tasks, as well as standard learning methods under resource constrained data collection. To address this we propose a new architecture which Stacks Quantization Modules (SQM), consisting of a series of discrete autoencoders, each equipped with their own memory. Every added module is trained to reconstruct the latent space of the previous module using fewer bits, allowing the learned representation to become more compact as training progresses. This modularity has several advantages: 1) moderate compressions are quickly available early in training, which is crucial for remembering the early tasks, 2) as more data needs to be stored, earlier data becomes more compressed, freeing memory, 3) unlike previous methods, our approach does not require pretraining, even on challenging datasets. We show several potential applications of this method. We first replace the episodic memory used in Experience Replay with SQM, leading to significant gains on standard continual learning benchmarks using a fixed memory budget. We then apply our method to online compression of larger images like those from Imagenet, and show that it is also effective with other modalities, such as LiDAR data.

Via

Access Paper or Ask Questions

Online Continual Learning with Maximally Interfered Retrieval

Aug 23, 2019

Rahaf Aljundi, Lucas Caccia, Eugene Belilovsky, Massimo Caccia, Min Lin, Laurent Charlin, Tinne Tuytelaars

Figure 1 for Online Continual Learning with Maximally Interfered Retrieval

Figure 2 for Online Continual Learning with Maximally Interfered Retrieval

Figure 3 for Online Continual Learning with Maximally Interfered Retrieval

Figure 4 for Online Continual Learning with Maximally Interfered Retrieval

Abstract:Continual learning, the setting where a learning agent is faced with a never ending stream of data, continues to be a great challenge for modern machine learning systems. In particular the online or "single-pass through the data" setting has gained attention recently as a natural setting that is difficult to tackle. Methods based on replay, either generative or from a stored memory, have been shown to be effective approaches for continual learning, matching or exceeding the state of the art in a number of standard benchmarks. These approaches typically rely on randomly selecting samples from the replay memory or from a generative model, which is suboptimal. In this work we consider a controlled sampling of memories for replay. We retrieve the samples which are most interfered, i.e. whose prediction will be most negatively impacted by the foreseen parameters update. We show a formulation for this sampling criterion in both the generative replay and the experience replay setting, producing consistent gains in performance and greatly reduced forgetting.

Via

Access Paper or Ask Questions

Recurrent Value Functions

May 23, 2019

Pierre Thodoroff, Nishanth Anand, Lucas Caccia, Doina Precup, Joelle Pineau

Abstract:Despite recent successes in Reinforcement Learning, value-based methods often suffer from high variance hindering performance. In this paper, we illustrate this in a continuous control setting where state of the art methods perform poorly whenever sensor noise is introduced. To overcome this issue, we introduce Recurrent Value Functions (RVFs) as an alternative to estimate the value function of a state. We propose to estimate the value function of the current state using the value function of past states visited along the trajectory. Due to the nature of their formulation, RVFs have a natural way of learning an emphasis function that selectively emphasizes important states. First, we establish RVF's asymptotic convergence properties in tabular settings. We then demonstrate their robustness on a partially observable domain and continuous control tasks. Finally, we provide a qualitative interpretation of the learned emphasis function.

Via

Access Paper or Ask Questions