Abstract:Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet their prediction process remains difficult to interpret because information is propagated through complex interactions across layers and attention heads. Existing attention based explanation methods provide an intuitive way to trace information flow. However, they rely mainly on raw attention weights, which do not explicitly reflect the final decision and often lead to explanations with limited class discriminability. In contrast, gradient based localization methods are more effective at highlighting class specific evidence, but they do not fully exploit the hierarchical attention propagation mechanism of transformers. To address this limitation, we propose Decision-Aware Attention Propagation (DAP), an attribution method that injects decision-relevant priors into transformer attention propagation. By estimating token importance through gradient based localization and integrating it into layer wise attention rollout, the method captures both the structural flow of attention and the evidence most relevant to the final prediction. Consequently, DAP produces attribution maps that are more class sensitive, compact, and faithful than those generated by conventional attention based methods. Extensive experiments across Vision Transformer variants of different model scales show that DAP consistently outperforms existing baselines in both quantitative metrics and qualitative visualizations, indicating that decision aware propagation is an effective direction for improving ViT interpretability.




Abstract:The Vision Transformer (ViT) has made significant advancements in computer vision, utilizing self-attention mechanisms to achieve state-of-the-art performance across various tasks, including image classification, object detection, and segmentation. Its architectural flexibility and capabilities have made it a preferred choice among researchers and practitioners. However, the intricate multi-head attention mechanism of ViT presents significant challenges to interpretability, as the underlying prediction process remains opaque. A critical limitation arises from an observation commonly noted in transformer architectures: "Not all attention heads are equally meaningful." Overlooking the relative importance of specific heads highlights the limitations of existing interpretability methods. To address these challenges, we introduce Gradient-Driven Multi-Head Attention Rollout (GMAR), a novel method that quantifies the importance of each attention head using gradient-based scores. These scores are normalized to derive a weighted aggregate attention score, effectively capturing the relative contributions of individual heads. GMAR clarifies the role of each head in the prediction process, enabling more precise interpretability at the head level. Experimental results demonstrate that GMAR consistently outperforms traditional attention rollout techniques. This work provides a practical contribution to transformer-based architectures, establishing a robust framework for enhancing the interpretability of Vision Transformer models.