Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yannick Metz

Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs

Jun 25, 2026

Sinie van der Ben, Raphaël Baur, Yannick Metz, Mennatallah El-Assady

Abstract:Recent work identified emotion vectors in Claude Sonnet 4.5, which are internal representations that encode emotion concepts, causally influence behavior, and exhibit geometry mirroring human psychological structure. We test the generality of these findings in two open-weight models, Apertus-8B-Instruct-2509 and Gemma-4-E4B-it, extracting emotion contrast vectors across all layers, using two model-generated corpora. We recover valence geometry for both models, with peak PC1--valence correlations of $r = 0.76$ and $r = 0.83$, approaching the $r = 0.81$ reported for Claude.Beyond replication, we observe notable differences in how valence representations emerge across model depth. In Gemma-4-E4B-it, valence is strongly encoded in early layers but collapses towards later layers, whereas Apertus-8B-Instruct-2509 exhibits the opposite pattern, with valence representations absent in early layers, but emerging at mid depths. Arousal encoding, in contrast, is sensitive to the extraction corpus: both models show stronger PC2--arousal alignment with Gemma-generated stories ($r$ up to $0.45$) than Apertus-generated ones ($r \leq 0.21$), suggesting arousal-relevant cues are unevenly distributed across generated corpora. We open-source our experiment code and dataset for reproducible investigation of emotion representations across language model architectures.

Via

Access Paper or Ask Questions

MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference

Feb 16, 2026

Raphaël Baur, Yannick Metz, Maria Gkoulta, Mennatallah El-Assady, Giorgia Ramponi, Thomas Kleine Buening

Abstract:Reward learning typically relies on a single feedback type or combines multiple feedback types using manually weighted loss terms. Currently, it remains unclear how to jointly learn reward functions from heterogeneous feedback types such as demonstrations, comparisons, ratings, and stops that provide qualitatively different signals. We address this challenge by formulating reward learning from multiple feedback types as Bayesian inference over a shared latent reward function, where each feedback type contributes information through an explicit likelihood. We introduce a scalable amortized variational inference approach that learns a shared reward encoder and feedback-specific likelihood decoders and is trained by optimizing a single evidence lower bound. Our approach avoids reducing feedback to a common intermediate representation and eliminates the need for manual loss balancing. Across discrete and continuous-control benchmarks, we show that jointly inferred reward posteriors outperform single-type baselines, exploit complementary information across feedback types, and yield policies that are more robust to environment perturbations. The inferred reward uncertainty further provides interpretable signals for analyzing model confidence and consistency across feedback types.

* 25 pages, 7 figures

Via

Access Paper or Ask Questions

ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning

Dec 31, 2025

Timo Kaufmann, Yannick Metz, Daniel Keim, Eyke Hüllermeier

Abstract:Binary choices, as often used for reinforcement learning from human feedback (RLHF), convey only the direction of a preference. A person may choose apples over oranges and bananas over grapes, but which preference is stronger? Strength is crucial for decision-making under uncertainty and generalization of preference models, but hard to measure reliably. Metadata such as response times and inter-annotator agreement can serve as proxies for strength, but are often noisy and confounded. We propose ResponseRank to address the challenge of learning from noisy strength signals. Our method uses relative differences in proxy signals to rank responses to pairwise comparisons by their inferred preference strength. To control for systemic variation, we compare signals only locally within carefully constructed strata. This enables robust learning of utility differences consistent with strength-derived rankings while making minimal assumptions about the strength signal. Our contributions are threefold: (1) ResponseRank, a novel method that robustly learns preference strength by leveraging locally valid relative strength signals; (2) empirical evidence of improved sample efficiency and robustness across diverse tasks: synthetic preference learning (with simulated response times), language modeling (with annotator agreement), and RL control tasks (with simulated episode returns); and (3) the Pearson Distance Correlation (PDC), a novel metric that isolates cardinal utility learning from ordinal accuracy.

* NeurIPS 2025

Via

Access Paper or Ask Questions

Evaluating Autoencoders for Parametric and Invertible Multidimensional Projections

Apr 23, 2025

Frederik L. Dennig, Nina Geyer, Daniela Blumberg, Yannick Metz, Daniel A. Keim

Figure 1 for Evaluating Autoencoders for Parametric and Invertible Multidimensional Projections

Figure 2 for Evaluating Autoencoders for Parametric and Invertible Multidimensional Projections

Figure 3 for Evaluating Autoencoders for Parametric and Invertible Multidimensional Projections

Figure 4 for Evaluating Autoencoders for Parametric and Invertible Multidimensional Projections

Abstract:Recently, neural networks have gained attention for creating parametric and invertible multidimensional data projections. Parametric projections allow for embedding previously unseen data without recomputing the projection as a whole, while invertible projections enable the generation of new data points. However, these properties have never been explored simultaneously for arbitrary projection methods. We evaluate three autoencoder (AE) architectures for creating parametric and invertible projections. Based on a given projection, we train AEs to learn a mapping into 2D space and an inverse mapping into the original space. We perform a quantitative and qualitative comparison on four datasets of varying dimensionality and pattern complexity using t-SNE. Our results indicate that AEs with a customized loss function can create smoother parametric and inverse projections than feed-forward neural networks while giving users control over the strength of the smoothing effect.

* 12 pages, 7 figures, 2 tables, LaTeX; to appear at the 16th International EuroVis Workshop on Visual Analytics (EuroVA'25)

Via

Access Paper or Ask Questions

Reward Learning from Multiple Feedback Types

Feb 28, 2025

Yannick Metz, András Geiszl, Raphaël Baur, Mennatallah El-Assady

Abstract:Learning rewards from preference feedback has become an important tool in the alignment of agentic models. Preference-based feedback, often implemented as a binary comparison between multiple completions, is an established method to acquire large-scale human feedback. However, human feedback in other contexts is often much more diverse. Such diverse feedback can better support the goals of a human annotator, and the simultaneous use of multiple sources might be mutually informative for the learning process or carry type-dependent biases for the reward learning process. Despite these potential benefits, learning from different feedback types has yet to be explored extensively. In this paper, we bridge this gap by enabling experimentation and evaluating multi-type feedback in a broad set of environments. We present a process to generate high-quality simulated feedback of six different types. Then, we implement reward models and downstream RL training for all six feedback types. Based on the simulated feedback, we investigate the use of types of feedback across ten RL environments and compare them to pure preference-based baselines. We show empirically that diverse types of feedback can be utilized and lead to strong reward modeling performance. This work is the first strong indicator of the potential of multi-type feedback for RLHF.

* Published as a conference paper at ICLR 2025

Via

Access Paper or Ask Questions

Leveraging Color Channel Independence for Improved Unsupervised Object Detection

Dec 19, 2024

Bastian Jäckl, Yannick Metz, Udo Schlegel, Daniel A. Keim, Maximilian T. Fischer

Figure 1 for Leveraging Color Channel Independence for Improved Unsupervised Object Detection

Figure 2 for Leveraging Color Channel Independence for Improved Unsupervised Object Detection

Figure 3 for Leveraging Color Channel Independence for Improved Unsupervised Object Detection

Figure 4 for Leveraging Color Channel Independence for Improved Unsupervised Object Detection

Abstract:Object-centric architectures can learn to extract distinct object representations from visual scenes, enabling downstream applications on the object level. Similarly to autoencoder-based image models, object-centric approaches have been trained on the unsupervised reconstruction loss of images encoded by RGB color spaces. In our work, we challenge the common assumption that RGB images are the optimal color space for unsupervised learning in computer vision. We discuss conceptually and empirically that other color spaces, such as HSV, bear essential characteristics for object-centric representation learning, like robustness to lighting conditions. We further show that models improve when requiring them to predict additional color channels. Specifically, we propose to transform the predicted targets to the RGB-S space, which extends RGB with HSV's saturation component and leads to markedly better reconstruction and disentanglement for five common evaluation datasets. The use of composite color spaces can be implemented with basically no computational overhead, is agnostic of the models' architecture, and is universally applicable across a wide range of visual computing tasks and training types. The findings of our approach encourage additional investigations in computer vision tasks beyond object-centric learning.

* 38 pages incl. references, 16 figures

Via

Access Paper or Ask Questions

Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework

Nov 18, 2024

Yannick Metz, David Lindner, Raphaël Baur, Mennatallah El-Assady

Figure 1 for Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework

Figure 2 for Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework

Figure 3 for Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework

Figure 4 for Mapping out the Space of Human Feedback for Reinforcement Learning: A Conceptual Framework

Abstract:Reinforcement Learning from Human feedback (RLHF) has become a powerful tool to fine-tune or train agentic machine learning models. Similar to how humans interact in social contexts, we can use many types of feedback to communicate our preferences, intentions, and knowledge to an RL agent. However, applications of human feedback in RL are often limited in scope and disregard human factors. In this work, we bridge the gap between machine learning and human-computer interaction efforts by developing a shared understanding of human feedback in interactive learning scenarios. We first introduce a taxonomy of feedback types for reward-based learning from human feedback based on nine key dimensions. Our taxonomy allows for unifying human-centered, interface-centered, and model-centered aspects. In addition, we identify seven quality metrics of human feedback influencing both the human ability to express feedback and the agent's ability to learn from the feedback. Based on the feedback taxonomy and quality criteria, we derive requirements and design choices for systems learning from human feedback. We relate these requirements and design choices to existing work in interactive machine learning. In the process, we identify gaps in existing work and future research opportunities. We call for interdisciplinary collaboration to harness the full potential of reinforcement learning with data-driven co-adaptive modeling and varied interaction mechanics.

Via

Access Paper or Ask Questions

RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback

Aug 08, 2023

Yannick Metz, David Lindner, Raphaël Baur, Daniel Keim, Mennatallah El-Assady

Figure 1 for RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback

Figure 2 for RLHF-Blender: A Configurable Interactive Interface for Learning from Diverse Human Feedback

Abstract:To use reinforcement learning from human feedback (RLHF) in practical applications, it is crucial to learn reward models from diverse sources of human feedback and to consider human factors involved in providing feedback of different types. However, the systematic study of learning from diverse types of feedback is held back by limited standardized tooling available to researchers. To bridge this gap, we propose RLHF-Blender, a configurable, interactive interface for learning from human feedback. RLHF-Blender provides a modular experimentation framework and implementation that enables researchers to systematically investigate the properties and qualities of human feedback for reward learning. The system facilitates the exploration of various feedback types, including demonstrations, rankings, comparisons, and natural language instructions, as well as studies considering the impact of human factors on their effectiveness. We discuss a set of concrete research opportunities enabled by RLHF-Blender. More information is available at https://rlhfblender.info/.

* ICML2023 Interactive Learning from Implicit Human Feedback Workshop
* 14 pages, 3 figures

Via

Access Paper or Ask Questions

How to Enable Uncertainty Estimation in Proximal Policy Optimization

Oct 07, 2022

Eugene Bykovets, Yannick Metz, Mennatallah El-Assady, Daniel A. Keim, Joachim M. Buhmann

Figure 1 for How to Enable Uncertainty Estimation in Proximal Policy Optimization

Figure 2 for How to Enable Uncertainty Estimation in Proximal Policy Optimization

Figure 3 for How to Enable Uncertainty Estimation in Proximal Policy Optimization

Figure 4 for How to Enable Uncertainty Estimation in Proximal Policy Optimization

Abstract:While deep reinforcement learning (RL) agents have showcased strong results across many domains, a major concern is their inherent opaqueness and the safety of such systems in real-world use cases. To overcome these issues, we need agents that can quantify their uncertainty and detect out-of-distribution (OOD) states. Existing uncertainty estimation techniques, like Monte-Carlo Dropout or Deep Ensembles, have not seen widespread adoption in on-policy deep RL. We posit that this is due to two reasons: concepts like uncertainty and OOD states are not well defined compared to supervised learning, especially for on-policy RL methods. Secondly, available implementations and comparative studies for uncertainty estimation methods in RL have been limited. To overcome the first gap, we propose definitions of uncertainty and OOD for Actor-Critic RL algorithms, namely, proximal policy optimization (PPO), and present possible applicable measures. In particular, we discuss the concepts of value and policy uncertainty. The second point is addressed by implementing different uncertainty estimation methods and comparing them across a number of environments. The OOD detection performance is evaluated via a custom evaluation benchmark of in-distribution (ID) and OOD states for various RL environments. We identify a trade-off between reward and OOD detection performance. To overcome this, we formulate a Pareto optimization problem in which we simultaneously optimize for reward and OOD detection performance. We show experimentally that the recently proposed method of Masksembles strikes a favourable balance among the survey methods, enabling high-quality uncertainty estimation and OOD detection while matching the performance of original RL agents.

Via

Access Paper or Ask Questions

BARReL: Bottleneck Attention for Adversarial Robustness in Vision-Based Reinforcement Learning

Aug 22, 2022

Eugene Bykovets, Yannick Metz, Mennatallah El-Assady, Daniel A. Keim, Joachim M. Buhmann

Figure 1 for BARReL: Bottleneck Attention for Adversarial Robustness in Vision-Based Reinforcement Learning

Figure 2 for BARReL: Bottleneck Attention for Adversarial Robustness in Vision-Based Reinforcement Learning

Figure 3 for BARReL: Bottleneck Attention for Adversarial Robustness in Vision-Based Reinforcement Learning

Figure 4 for BARReL: Bottleneck Attention for Adversarial Robustness in Vision-Based Reinforcement Learning

Abstract:Robustness to adversarial perturbations has been explored in many areas of computer vision. This robustness is particularly relevant in vision-based reinforcement learning, as the actions of autonomous agents might be safety-critic or impactful in the real world. We investigate the susceptibility of vision-based reinforcement learning agents to gradient-based adversarial attacks and evaluate a potential defense. We observe that Bottleneck Attention Modules (BAM) included in CNN architectures can act as potential tools to increase robustness against adversarial attacks. We show how learned attention maps can be used to recover activations of a convolutional layer by restricting the spatial activations to salient regions. Across a number of RL environments, BAM-enhanced architectures show increased robustness during inference. Finally, we discuss potential future research directions.

* 5 pages, 2 figures, 3 tables

Via

Access Paper or Ask Questions