Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sepp Hochreiter

One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

Oct 09, 2024

Fabian Paischer, Lukas Hauzenberger, Thomas Schmied, Benedikt Alkin, Marc Peter Deisenroth, Sepp Hochreiter

Figure 1 for One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

Figure 2 for One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

Figure 3 for One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

Figure 4 for One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

Abstract:Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned on a downstream task for a specific application. The most successful and most commonly used fine-tuning method is to update the pre-trained weights via a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are usually initialized at random with a uniform rank distribution across model weights. Recent works focus on weight-driven initialization or learning of adaptive ranks during training. Both approaches have only been investigated in isolation, resulting in slow convergence or a uniform rank distribution, in turn leading to sub-optimal performance. We propose to enhance LoRA by initializing the new weights in a data-driven manner by computing singular value decomposition on minibatches of activation vectors. Then, we initialize the LoRA matrices with the obtained right-singular vectors and re-distribute ranks among all weight matrices to explain the maximal amount of variance and continue the standard LoRA fine-tuning procedure. This results in our new method Explained Variance Adaptation (EVA). We apply EVA to a variety of fine-tuning tasks ranging from language generation and understanding to image classification and reinforcement learning. EVA exhibits faster convergence than competitors and attains the highest average score across a multitude of tasks per domain.

* 10 pages + references and appendix, code available at https://github.com/ml-jku/EVA

Via

Access Paper or Ask Questions

Contrastive Abstraction for Reinforcement Learning

Oct 01, 2024

Vihang Patil, Markus Hofmarcher, Elisabeth Rumetshofer, Sepp Hochreiter

Figure 1 for Contrastive Abstraction for Reinforcement Learning

Figure 2 for Contrastive Abstraction for Reinforcement Learning

Figure 3 for Contrastive Abstraction for Reinforcement Learning

Figure 4 for Contrastive Abstraction for Reinforcement Learning

Abstract:Learning agents with reinforcement learning is difficult when dealing with long trajectories that involve a large number of states. To address these learning problems effectively, the number of states can be reduced by abstract representations that cluster states. In principle, deep reinforcement learning can find abstract states, but end-to-end learning is unstable. We propose contrastive abstraction learning to find abstract states, where we assume that successive states in a trajectory belong to the same abstract state. Such abstract states may be basic locations, achieved subgoals, inventory, or health conditions. Contrastive abstraction learning first constructs clusters of state representations by contrastive learning and then applies modern Hopfield networks to determine the abstract states. The first phase of contrastive abstraction learning is self-supervised learning, where contrastive learning forces states with sequential proximity to have similar representations. The second phase uses modern Hopfield networks to map similar state representations to the same fixed point, i.e.\ to an abstract state. The level of abstraction can be adjusted by determining the number of fixed points of the modern Hopfield network. Furthermore, \textit{contrastive abstraction learning} does not require rewards and facilitates efficient reinforcement learning for a wide range of downstream tasks. Our experiments demonstrate the effectiveness of contrastive abstraction learning for reinforcement learning.

Via

Access Paper or Ask Questions

Simplified priors for Object-Centric Learning

Oct 01, 2024

Vihang Patil, Andreas Radler, Daniel Klotz, Sepp Hochreiter

Figure 1 for Simplified priors for Object-Centric Learning

Figure 2 for Simplified priors for Object-Centric Learning

Figure 3 for Simplified priors for Object-Centric Learning

Figure 4 for Simplified priors for Object-Centric Learning

Abstract:Humans excel at abstracting data and constructing \emph{reusable} concepts, a capability lacking in current continual learning systems. The field of object-centric learning addresses this by developing abstract representations, or slots, from data without human supervision. Different methods have been proposed to tackle this task for images, whereas most are overly complex, non-differentiable, or poorly scalable. In this paper, we introduce a conceptually simple, fully-differentiable, non-iterative, and scalable method called SAMP Simplified Slot Attention with Max Pool Priors). It is implementable using only Convolution and MaxPool layers and an Attention layer. Our method encodes the input image with a Convolutional Neural Network and then uses a branch of alternating Convolution and MaxPool layers to create specialized sub-networks and extract primitive slots. These primitive slots are then used as queries for a Simplified Slot Attention over the encoded image. Despite its simplicity, our method is competitive or outperforms previous methods on standard benchmarks.

Via

Access Paper or Ask Questions

Comparison Visual Instruction Tuning

Jun 13, 2024

Wei Lin, Muhammad Jehanzeb Mirza, Sivan Doveh, Rogerio Feris, Raja Giryes, Sepp Hochreiter, Leonid Karlinsky

Figure 1 for Comparison Visual Instruction Tuning

Figure 2 for Comparison Visual Instruction Tuning

Figure 3 for Comparison Visual Instruction Tuning

Figure 4 for Comparison Visual Instruction Tuning

Abstract:Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation. It is essential for the generation of detailed and contextually relevant descriptions, performing comparative analysis, novelty detection, and making informed decisions based on visual data. However, surprisingly, little attention has been given to these fundamental concepts in the best current mimic of human visual intelligence - Large Multimodal Models (LMMs). We develop and contribute a new two-phase approach CaD-VI for collecting synthetic visual instructions, together with an instruction-following dataset CaD-Inst containing 349K image pairs with CaD instructions collected using CaD-VI. Our approach significantly improves the CaD spotting capabilities in LMMs, advancing the SOTA on a diverse set of related tasks by up to 17.5%. It is also complementary to existing difference-only instruction datasets, allowing automatic targeted refinement of those resources increasing their effectiveness for CaD tuning by up to 10%. Additionally, we propose an evaluation benchmark with 7.5K open-ended QAs to assess the CaD understanding abilities of LMMs.

* Project page: https://wlin-at.github.io/cad_vi ; Huggingface dataset repo: https://huggingface.co/datasets/wlin21at/CaD-Inst

Via

Access Paper or Ask Questions

Semantically Diverse Language Generation for Uncertainty Estimation in Language Models

Jun 06, 2024

Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, Sepp Hochreiter

Figure 1 for Semantically Diverse Language Generation for Uncertainty Estimation in Language Models

Figure 2 for Semantically Diverse Language Generation for Uncertainty Estimation in Language Models

Figure 3 for Semantically Diverse Language Generation for Uncertainty Estimation in Language Models

Figure 4 for Semantically Diverse Language Generation for Uncertainty Estimation in Language Models

Abstract:Large language models (LLMs) can suffer from hallucinations when generating text. These hallucinations impede various applications in society and industry by making LLMs untrustworthy. Current LLMs generate text in an autoregressive fashion by predicting and appending text tokens. When an LLM is uncertain about the semantic meaning of the next tokens to generate, it is likely to start hallucinating. Thus, it has been suggested that hallucinations stem from predictive uncertainty. We introduce Semantically Diverse Language Generation (SDLG) to quantify predictive uncertainty in LLMs. SDLG steers the LLM to generate semantically diverse yet likely alternatives for an initially generated text. This approach provides a precise measure of aleatoric semantic uncertainty, detecting whether the initial text is likely to be hallucinated. Experiments on question-answering tasks demonstrate that SDLG consistently outperforms existing methods while being the most computationally efficient, setting a new standard for uncertainty estimation in LLMs.

Via

Access Paper or Ask Questions

Vision-LSTM: xLSTM as Generic Vision Backbone

Jun 06, 2024

Benedikt Alkin, Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, Johannes Brandstetter

Abstract:Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

Via

Access Paper or Ask Questions

A Diffusion Model Framework for Unsupervised Neural Combinatorial Optimization

Jun 03, 2024

Sebastian Sanokowski, Sepp Hochreiter, Sebastian Lehner

Figure 1 for A Diffusion Model Framework for Unsupervised Neural Combinatorial Optimization

Figure 2 for A Diffusion Model Framework for Unsupervised Neural Combinatorial Optimization

Figure 3 for A Diffusion Model Framework for Unsupervised Neural Combinatorial Optimization

Figure 4 for A Diffusion Model Framework for Unsupervised Neural Combinatorial Optimization

Abstract:Learning to sample from intractable distributions over discrete sets without relying on corresponding training data is a central problem in a wide range of fields, including Combinatorial Optimization. Currently, popular deep learning-based approaches rely primarily on generative models that yield exact sample likelihoods. This work introduces a method that lifts this restriction and opens the possibility to employ highly expressive latent variable models like diffusion models. Our approach is conceptually based on a loss that upper bounds the reverse Kullback-Leibler divergence and evades the requirement of exact sample likelihoods. We experimentally validate our approach in data-free Combinatorial Optimization and demonstrate that our method achieves a new state-of-the-art on a wide range of benchmark problems.

* Accepted at ICML 2024

Via

Access Paper or Ask Questions

Large Language Models Can Self-Improve At Web Agent Tasks

May 30, 2024

Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, Sepp Hochreiter

Figure 1 for Large Language Models Can Self-Improve At Web Agent Tasks

Figure 2 for Large Language Models Can Self-Improve At Web Agent Tasks

Figure 3 for Large Language Models Can Self-Improve At Web Agent Tasks

Figure 4 for Large Language Models Can Self-Improve At Web Agent Tasks

Abstract:Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three distinct synthetic training data mixtures and achieve a 31\% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure. We additionally contribute novel evaluation metrics for assessing the performance, robustness, capabilities, and quality of trajectories of our fine-tuned agent models to a greater degree than simple, aggregate-level benchmark scores currently used to measure self-improvement.

Via

Access Paper or Ask Questions

Energy-based Hopfield Boosting for Out-of-Distribution Detection

May 14, 2024

Claus Hofmann, Simon Schmid, Bernhard Lehner, Daniel Klotz, Sepp Hochreiter

Figure 1 for Energy-based Hopfield Boosting for Out-of-Distribution Detection

Figure 2 for Energy-based Hopfield Boosting for Out-of-Distribution Detection

Figure 3 for Energy-based Hopfield Boosting for Out-of-Distribution Detection

Figure 4 for Energy-based Hopfield Boosting for Out-of-Distribution Detection

Abstract:Out-of-distribution (OOD) detection is critical when deploying machine learning models in the real world. Outlier exposure methods, which incorporate auxiliary outlier data in the training process, can drastically improve OOD detection performance compared to approaches without advanced training strategies. We introduce Hopfield Boosting, a boosting approach, which leverages modern Hopfield energy (MHE) to sharpen the decision boundary between the in-distribution and OOD data. Hopfield Boosting encourages the model to concentrate on hard-to-distinguish auxiliary outlier examples that lie close to the decision boundary between in-distribution and auxiliary outlier data. Our method achieves a new state-of-the-art in OOD detection with outlier exposure, improving the FPR95 metric from 2.28 to 0.92 on CIFAR-10 and from 11.76 to 7.94 on CIFAR-100.

Via

Access Paper or Ask Questions

xLSTM: Extended Long Short-Term Memory

May 07, 2024

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter

Figure 1 for xLSTM: Extended Long Short-Term Memory

Figure 2 for xLSTM: Extended Long Short-Term Memory

Figure 3 for xLSTM: Extended Long Short-Term Memory

Figure 4 for xLSTM: Extended Long Short-Term Memory

Abstract:In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.

Via

Access Paper or Ask Questions