Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Krispin Wandel

ViT-Up: Faithful Feature Upsampling for Vision Transformers

Jun 12, 2026

Krispin Wandel, Jingchuan Wang, Hesheng Wang

Abstract:Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.

* Code is available at: https://github.com/krispinwandel/vit-up

Via

Access Paper or Ask Questions

SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations

Mar 28, 2025

Krispin Wandel, Hesheng Wang

Figure 1 for SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations

Figure 2 for SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations

Figure 3 for SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations

Figure 4 for SemAlign3D: Semantic Correspondence between RGB-Images through Aligning 3D Object-Class Representations

Abstract:Semantic correspondence made tremendous progress through the recent advancements of large vision models (LVM). While these LVMs have been shown to reliably capture local semantics, the same can currently not be said for capturing global geometric relationships between semantic object regions. This problem leads to unreliable performance for semantic correspondence between images with extreme view variation. In this work, we aim to leverage monocular depth estimates to capture these geometric relationships for more robust and data-efficient semantic correspondence. First, we introduce a simple but effective method to build 3D object-class representations from monocular depth estimates and LVM features using a sparsely annotated image correspondence dataset. Second, we formulate an alignment energy that can be minimized using gradient descent to obtain an alignment between the 3D object-class representation and the object-class instance in the input RGB-image. Our method achieves state-of-the-art matching accuracy in multiple categories on the challenging SPair-71k dataset, increasing the PCK@0.1 score by more than 10 points on three categories and overall by 3.3 points from 85.6% to 88.9%. Additional resources and code are available at https://dub.sh/semalign3d.

* Accepted to CVPR 2025. Poster: https://cvpr.thecvf.com/virtual/2025/poster/32799

Via

Access Paper or Ask Questions