Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shicheng Yin

DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Transformer and Mamba

Jun 12, 2025

Shicheng Yin, Kaixuan Yin, Yang Liu, Weixing Chen, Liang Lin

Abstract:Recently, non-convolutional models such as the Vision Transformer (ViT) and Vision Mamba (Vim) have achieved remarkable performance in computer vision tasks. However, their reliance on fixed-size patches often results in excessive encoding of background regions and omission of critical local details, especially when informative objects are sparsely distributed. To address this, we introduce a fully differentiable Dynamic Adaptive Region Tokenizer (DART), which adaptively partitions images into content-dependent patches of varying sizes. DART combines learnable region scores with piecewise differentiable quantile operations to allocate denser tokens to information-rich areas. Despite introducing only approximately 1 million (1M) additional parameters, DART improves accuracy by 2.1% on DeiT (ImageNet-1K). Unlike methods that uniformly increase token density to capture fine-grained details, DART offers a more efficient alternative, achieving 45% FLOPs reduction with superior performance. Extensive experiments on DeiT, Vim, and VideoMamba confirm that DART consistently enhances accuracy while incurring minimal or even reduced computational overhead. Code is available at https://github.com/HCPLab-SYSU/DART.

* Code is available at https://github.com/HCPLab-SYSU/DART

Via

Access Paper or Ask Questions

VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis

Dec 24, 2024

Shicheng Yin, Kaixuan Yin, Weixing Chen, Enbo Huang, Yang Liu

Figure 1 for VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis

Figure 2 for VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis

Figure 3 for VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis

Figure 4 for VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis

Abstract:Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are two dominant models for image analysis. While CNNs excel at extracting multi-scale features and ViTs effectively capture global dependencies, both suffer from high computational costs, particularly when processing high-resolution images. Recently, state-space models (SSMs) and recurrent neural networks (RNNs) have attracted attention due to their efficiency. However, their performance in image classification tasks remains limited. To address these challenges, this paper introduces VisionGRU, a novel RNN-based architecture designed for efficient image classification. VisionGRU leverages a simplified Gated Recurrent Unit (minGRU) to process large-scale image features with linear complexity. It divides images into smaller patches and progressively reduces the sequence length while increasing the channel depth, thus facilitating multi-scale feature extraction. A hierarchical 2DGRU module with bidirectional scanning captures both local and global contexts, improving long-range dependency modeling, particularly for tasks like semantic segmentation. Experimental results on the ImageNet and ADE20K datasets demonstrate that VisionGRU outperforms ViTs, significantly reducing memory usage and computational costs, especially for high-resolution images. These findings underscore the potential of RNN-based approaches for developing efficient and scalable computer vision solutions. Codes will be available at https://github.com/YangLiu9208/VisionGRU.

* Codes will be available at https://github.com/YangLiu9208/VisionGRU

Via

Access Paper or Ask Questions