Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

HRViT: Multi-Scale High-Resolution Vision Transformer

Nov 01, 2021
Jiaqi Gu, Hyoukjun Kwon, Dilin Wang, Wei Ye, Meng Li, Yu-Hsin Chen, Liangzhen Lai, Vikas Chandra, David Z. Pan

Share this with someone who'll enjoy it:

Vision transformers (ViTs) have attracted much attention for their superior performance on computer vision tasks. To address their limitations of single-scale low-resolution representations, prior work adapts ViTs to high-resolution dense prediction tasks with hierarchical architectures to generate pyramid features. However, multi-scale representation learning is still under-explored on ViTs, given their classification-like sequential topology. To enhance ViTs with more capability to learn semantically-rich and spatially-precise multi-scale representations, in this work, we present an efficient integration of high-resolution multi-branch architectures with vision transformers, dubbed HRViT, pushing the Pareto front of dense prediction tasks to a new level. We explore heterogeneous branch design, reduce the redundancy in linear layers, and augment the model nonlinearity to balance the model performance and hardware efficiency. The proposed HRViT achieves 50.20% mIoU on ADE20K and 83.16% mIoU on Cityscapes for semantic segmentation tasks, surpassing state-of-the-art MiT and CSWin with an average of +1.78 mIoU improvement, 28% parameter reduction, and 21% FLOPs reduction, demonstrating the potential of HRViT as strong vision backbones.

* 8 pages 

   Access Paper Source

Share this with someone who'll enjoy it: