Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Martin Wistuba

Hyperband-based Bayesian Optimization for Black-box Prompt Selection

Dec 10, 2024

Lennart Schneider, Martin Wistuba, Aaron Klein, Jacek Golebiowski, Giovanni Zappella, Felice Antonio Merra

Figure 1 for Hyperband-based Bayesian Optimization for Black-box Prompt Selection

Figure 2 for Hyperband-based Bayesian Optimization for Black-box Prompt Selection

Figure 3 for Hyperband-based Bayesian Optimization for Black-box Prompt Selection

Figure 4 for Hyperband-based Bayesian Optimization for Black-box Prompt Selection

Abstract:Optimal prompt selection is crucial for maximizing large language model (LLM) performance on downstream tasks. As the most powerful models are proprietary and can only be invoked via an API, users often manually refine prompts in a black-box setting by adjusting instructions and few-shot examples until they achieve good performance as measured on a validation set. Recent methods addressing static black-box prompt selection face significant limitations: They often fail to leverage the inherent structure of prompts, treating instructions and few-shot exemplars as a single block of text. Moreover, they often lack query-efficiency by evaluating prompts on all validation instances, or risk sub-optimal selection of a prompt by using random subsets of validation instances. We introduce HbBoPs, a novel Hyperband-based Bayesian optimization method for black-box prompt selection addressing these key limitations. Our approach combines a structural-aware deep kernel Gaussian Process to model prompt performance with Hyperband as a multi-fidelity scheduler to select the number of validation instances for prompt evaluations. The structural-aware modeling approach utilizes separate embeddings for instructions and few-shot exemplars, enhancing the surrogate model's ability to capture prompt performance and predict which prompt to evaluate next in a sample-efficient manner. Together with Hyperband as a multi-fidelity scheduler we further enable query-efficiency by adaptively allocating resources across different fidelity levels, keeping the total number of validation instances prompts are evaluated on low. Extensive evaluation across ten benchmarks and three LLMs demonstrate that HbBoPs outperforms state-of-the-art methods.

Via

Access Paper or Ask Questions

Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need

Jun 05, 2024

Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella

Figure 1 for Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need

Figure 2 for Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need

Figure 3 for Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need

Figure 4 for Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need

Abstract:Recent Continual Learning (CL) methods have combined pretrained Transformers with prompt tuning, a parameter-efficient fine-tuning (PEFT) technique. We argue that the choice of prompt tuning in prior works was an undefended and unablated decision, which has been uncritically adopted by subsequent research, but warrants further research to understand its implications. In this paper, we conduct this research and find that the choice of prompt tuning as a PEFT method hurts the overall performance of the CL system. To illustrate this, we replace prompt tuning with LoRA in two state-of-the-art continual learning methods: Learning to Prompt and S-Prompts. These variants consistently achieve higher accuracy across a wide range of domain-incremental and class-incremental benchmarks, while being competitive in inference speed. Our work highlights a crucial argument: unexamined choices can hinder progress in the field, and rigorous ablations, such as the PEFT method, are required to drive meaningful adoption of CL techniques in real-world applications.

Via

Access Paper or Ask Questions

Continual Learning with Low Rank Adaptation

Nov 29, 2023

Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella

Figure 1 for Continual Learning with Low Rank Adaptation

Figure 2 for Continual Learning with Low Rank Adaptation

Figure 3 for Continual Learning with Low Rank Adaptation

Figure 4 for Continual Learning with Low Rank Adaptation

Abstract:Recent work using pretrained transformers has shown impressive performance when fine-tuned with data from the downstream problem of interest. However, they struggle to retain that performance when the data characteristics changes. In this paper, we focus on continual learning, where a pre-trained transformer is updated to perform well on new data, while retaining its performance on data it was previously trained on. Earlier works have tackled this primarily through methods inspired from prompt tuning. We question this choice, and investigate the applicability of Low Rank Adaptation (LoRA) to continual learning. On a range of domain-incremental learning benchmarks, our LoRA-based solution, CoLoR, yields state-of-the-art performance, while still being as parameter efficient as the prompt tuning based methods.

* Accepted at Workshop on Distribution Shifts (DistShift), NeurIPS 2023

Via

Access Paper or Ask Questions

Renate: A Library for Real-World Continual Learning

Apr 24, 2023

Martin Wistuba, Martin Ferianc, Lukas Balles, Cedric Archambeau, Giovanni Zappella

Figure 1 for Renate: A Library for Real-World Continual Learning

Figure 2 for Renate: A Library for Real-World Continual Learning

Figure 3 for Renate: A Library for Real-World Continual Learning

Abstract:Continual learning enables the incremental training of machine learning models on non-stationary data streams.While academic interest in the topic is high, there is little indication of the use of state-of-the-art continual learning algorithms in practical machine learning deployment. This paper presents Renate, a continual learning library designed to build real-world updating pipelines for PyTorch models. We discuss requirements for the use of continual learning algorithms in practice, from which we derive design principles for Renate. We give a high-level description of the library components and interfaces. Finally, we showcase the strengths of the library by presenting experimental results. Renate may be found at https://github.com/awslabs/renate.

* Paper accepted at the CLVision workshop at CVPR 2023

Via

Access Paper or Ask Questions

Variational Boosted Soft Trees

Feb 22, 2023

Tristan Cinquin, Tammo Rukat, Philipp Schmidt, Martin Wistuba, Artur Bekasov

Abstract:Gradient boosting machines (GBMs) based on decision trees consistently demonstrate state-of-the-art results on regression and classification tasks with tabular data, often outperforming deep neural networks. However, these models do not provide well-calibrated predictive uncertainties, which prevents their use for decision making in high-risk applications. The Bayesian treatment is known to improve predictive uncertainty calibration, but previously proposed Bayesian GBM methods are either computationally expensive, or resort to crude approximations. Variational inference is often used to implement Bayesian neural networks, but is difficult to apply to GBMs, because the decision trees used as weak learners are non-differentiable. In this paper, we propose to implement Bayesian GBMs using variational inference with soft decision trees, a fully differentiable alternative to standard decision trees introduced by Irsoy et al. Our experiments demonstrate that variational soft trees and variational soft GBMs provide useful uncertainty estimates, while retaining good predictive performance. The proposed models show higher test likelihoods when compared to the state-of-the-art Bayesian GBMs in 7/10 tabular regression datasets and improved out-of-distribution detection in 5/10 datasets.

Via

Access Paper or Ask Questions

Deep Power Laws for Hyperparameter Optimization

Feb 01, 2023

Arlind Kadra, Maciej Janowski, Martin Wistuba, Josif Grabocka

Figure 1 for Deep Power Laws for Hyperparameter Optimization

Figure 2 for Deep Power Laws for Hyperparameter Optimization

Figure 3 for Deep Power Laws for Hyperparameter Optimization

Figure 4 for Deep Power Laws for Hyperparameter Optimization

Abstract:Hyperparameter optimization is an important subfield of machine learning that focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance. Recently, there has been a stream of methods that tackle the issue of hyperparameter optimization, however, most of the methods do not exploit the scaling law property of learning curves. In this work, we propose Deep Power Laws (DPL), an ensemble of neural network models conditioned to yield predictions that follow a power-law scaling pattern. Our method dynamically decides which configurations to pause and train incrementally by making use of gray-box evaluations. We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 57 diverse tasks. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors.

Via

Access Paper or Ask Questions

Continual Learning with Transformers for Image Classification

Jun 28, 2022

Beyza Ermis, Giovanni Zappella, Martin Wistuba, Aditya Rawal, Cedric Archambeau

Figure 1 for Continual Learning with Transformers for Image Classification

Figure 2 for Continual Learning with Transformers for Image Classification

Figure 3 for Continual Learning with Transformers for Image Classification

Figure 4 for Continual Learning with Transformers for Image Classification

Abstract:In many real-world scenarios, data to train machine learning models become available over time. However, neural network models struggle to continually learn new concepts without forgetting what has been learnt in the past. This phenomenon is known as catastrophic forgetting and it is often difficult to prevent due to practical constraints, such as the amount of data that can be stored or the limited computation sources that can be used. Moreover, training large neural networks, such as Transformers, from scratch is very costly and requires a vast amount of training data, which might not be available in the application domain of interest. A recent trend indicates that dynamic architectures based on an expansion of the parameters can reduce catastrophic forgetting efficiently in continual learning, but this needs complex tuning to balance the growing number of parameters and barely share any information across tasks. As a result, they struggle to scale to a large number of tasks without significant overhead. In this paper, we validate in the computer vision domain a recent solution called Adaptive Distillation of Adapters (ADA), which is developed to perform continual learning using pre-trained Transformers and Adapters on text classification tasks. We empirically demonstrate on different classification tasks that this method maintains a good predictive performance without retraining the model or increasing the number of model parameters over the time. Besides it is significantly faster at inference time compared to the state-of-the-art methods.

* Appeared in CVPR CLVision workshop. arXiv admin note: substantial text overlap with arXiv:2203.04640

Via

Access Paper or Ask Questions

Memory Efficient Continual Learning for Neural Text Classification

Mar 09, 2022

Beyza Ermis, Giovanni Zappella, Martin Wistuba, Cedric Archambeau

Figure 1 for Memory Efficient Continual Learning for Neural Text Classification

Figure 2 for Memory Efficient Continual Learning for Neural Text Classification

Figure 3 for Memory Efficient Continual Learning for Neural Text Classification

Figure 4 for Memory Efficient Continual Learning for Neural Text Classification

Abstract:Learning text classifiers based on pre-trained language models has become the standard practice in natural language processing applications. Unfortunately, training large neural language models, such as transformers, from scratch is very costly and requires a vast amount of training data, which might not be available in the application domain of interest. Moreover, in many real-world scenarios, classes are uncovered as more data is seen, calling for class-incremental modelling approaches. In this work we devise a method to perform text classification using pre-trained models on a sequence of classification tasks provided in sequence. We formalize the problem as a continual learning problem where the algorithm learns new tasks without performance degradation on the previous ones and without re-training the model from scratch. We empirically demonstrate that our method requires significantly less model parameters compared to other state of the art methods and that it is significantly faster at inference time. The tight control on the number of model parameters, and so the memory, is not only improving efficiency. It is making possible the usage of the algorithm in real-world applications where deploying a solution with a constantly increasing memory consumption is just unrealistic. While our method suffers little forgetting, it retains a predictive performance on-par with state of the art but less memory efficient methods.

Via

Access Paper or Ask Questions

Dynamic and Efficient Gray-Box Hyperparameter Optimization for Deep Learning

Feb 20, 2022

Martin Wistuba, Arlind Kadra, Josif Grabocka

Figure 1 for Dynamic and Efficient Gray-Box Hyperparameter Optimization for Deep Learning

Figure 2 for Dynamic and Efficient Gray-Box Hyperparameter Optimization for Deep Learning

Figure 3 for Dynamic and Efficient Gray-Box Hyperparameter Optimization for Deep Learning

Figure 4 for Dynamic and Efficient Gray-Box Hyperparameter Optimization for Deep Learning

Abstract:Gray-box hyperparameter optimization techniques have recently emerged as a promising direction for tuning Deep Learning methods. In this work, we introduce DyHPO, a method that learns to dynamically decide which configuration to try next, and for what budget. Our technique is a modification to the classical Bayesian optimization for a gray-box setup. Concretely, we propose a new surrogate for Gaussian Processes that embeds the learning curve dynamics and a new acquisition function that incorporates multi-budget information. We demonstrate the significant superiority of DyHPO against state-of-the-art hyperparameter optimization baselines through large-scale experiments comprising 50 datasets (Tabular, Image, NLP) and diverse neural networks (MLP, CNN/NAS, RNN).

Via

Access Paper or Ask Questions

HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML

Jun 11, 2021

Sebastian Pineda Arango, Hadi S. Jomaa, Martin Wistuba, Josif Grabocka

Figure 1 for HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML

Figure 2 for HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML

Figure 3 for HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML

Figure 4 for HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML

Abstract:Hyperparameter optimization (HPO) is a core problem for the machine learning community and remains largely unsolved due to the significant computational resources required to evaluate hyperparameter configurations. As a result, a series of recent related works have focused on the direction of transfer learning for quickly fine-tuning hyperparameters on a dataset. Unfortunately, the community does not have a common large-scale benchmark for comparing HPO algorithms. Instead, the de facto practice consists of empirical protocols on arbitrary small-scale meta-datasets that vary inconsistently across publications, making reproducibility a challenge. To resolve this major bottleneck and enable a fair and fast comparison of black-box HPO methods on a level playing field, we propose HPO-B, a new large-scale benchmark in the form of a collection of meta-datasets. Our benchmark is assembled and preprocessed from the OpenML repository and consists of 176 search spaces (algorithms) evaluated sparsely on 196 datasets with a total of 6.4 million hyperparameter evaluations. For ensuring reproducibility on our benchmark, we detail explicit experimental protocols, splits, and evaluation measures for comparing methods for both non-transfer, as well as, transfer learning HPO.

Via

Access Paper or Ask Questions