Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Anand

GNN-MoE: Context-Aware Patch Routing using GNNs for Parameter-Efficient Domain Generalization

Nov 06, 2025

Mahmoud Soliman, Omar Abdelaziz, Ahmed Radwan, Anand, Mohamed Shehata

Figure 1 for GNN-MoE: Context-Aware Patch Routing using GNNs for Parameter-Efficient Domain Generalization

Figure 2 for GNN-MoE: Context-Aware Patch Routing using GNNs for Parameter-Efficient Domain Generalization

Figure 3 for GNN-MoE: Context-Aware Patch Routing using GNNs for Parameter-Efficient Domain Generalization

Figure 4 for GNN-MoE: Context-Aware Patch Routing using GNNs for Parameter-Efficient Domain Generalization

Abstract:Domain generalization (DG) seeks robust Vision Transformer (ViT) performance on unseen domains. Efficiently adapting pretrained ViTs for DG is challenging; standard fine-tuning is costly and can impair generalization. We propose GNN-MoE, enhancing Parameter-Efficient Fine-Tuning (PEFT) for DG with a Mixture-of-Experts (MoE) framework using efficient Kronecker adapters. Instead of token-based routing, a novel Graph Neural Network (GNN) router (GCN, GAT, SAGE) operates on inter-patch graphs to dynamically assign patches to specialized experts. This context-aware GNN routing leverages inter-patch relationships for better adaptation to domain shifts. GNN-MoE achieves state-of-the-art or competitive DG benchmark performance with high parameter efficiency, highlighting the utility of graph-based contextual routing for robust, lightweight DG.

* 6 pages, 3 figures

Via

Access Paper or Ask Questions

Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMS

Oct 26, 2025

Anand, Umberto Cappellazzo, Stavros Petridis, Maja Pantic

Abstract:Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature downsampling while remaining stable at lower downsampling rates.

* The code is available at https://github.com/umbertocappellazzo/Llama-AVSR

Via

Access Paper or Ask Questions