Alert button
Picture for Yunjie Tian

Yunjie Tian

Alert button

Spatial Transform Decoupling for Oriented Object Detection

Aug 21, 2023
Hongtian Yu, Yunjie Tian, Qixiang Ye, Yunfan Liu

Figure 1 for Spatial Transform Decoupling for Oriented Object Detection
Figure 2 for Spatial Transform Decoupling for Oriented Object Detection
Figure 3 for Spatial Transform Decoupling for Oriented Object Detection
Figure 4 for Spatial Transform Decoupling for Oriented Object Detection

Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. However, their potential in rotation-sensitive scenarios has not been fully explored, and this limitation may be inherently attributed to the lack of spatial invariance in the data-forwarding process. In this study, we present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs. Built upon stacked ViT blocks, STD utilizes separate network branches to predict the position, size, and angle of bounding boxes, effectively harnessing the spatial transform potential of ViTs in a divide-and-conquer fashion. Moreover, by aggregating cascaded activation masks (CAMs) computed upon the regressed parameters, STD gradually enhances features within regions of interest (RoIs), which complements the self-attention mechanism. Without bells and whistles, STD achieves state-of-the-art performance on the benchmark datasets including DOTA-v1.0 (82.24% mAP) and HRSC2016 (98.55% mAP), which demonstrates the effectiveness of the proposed method. Source code is available at https://github.com/yuhongtian17/Spatial-Transform-Decoupling.

Viaarxiv icon

Integrally Pre-Trained Transformer Pyramid Networks

Nov 23, 2022
Yunjie Tian, Lingxi Xie, Zhaozhi Wang, Longhui Wei, Xiaopeng Zhang, Jianbin Jiao, Yaowei Wang, Qi Tian, Qixiang Ye

Figure 1 for Integrally Pre-Trained Transformer Pyramid Networks
Figure 2 for Integrally Pre-Trained Transformer Pyramid Networks
Figure 3 for Integrally Pre-Trained Transformer Pyramid Networks
Figure 4 for Integrally Pre-Trained Transformer Pyramid Networks

In this paper, we present an integral pre-training framework based on masked image modeling (MIM). We advocate for pre-training the backbone and neck jointly so that the transfer gap between MIM and downstream recognition tasks is minimal. We make two technical contributions. First, we unify the reconstruction and recognition necks by inserting a feature pyramid into the pre-training stage. Second, we complement mask image modeling (MIM) with masked feature modeling (MFM) that offers multi-stage supervision to the feature pyramid. The pre-trained models, termed integrally pre-trained transformer pyramid networks (iTPNs), serve as powerful foundation models for visual recognition. In particular, the base/large-level iTPN achieves an 86.2%/87.8% top-1 accuracy on ImageNet-1K, a 53.2%/55.6% box AP on COCO object detection with 1x training schedule using Mask-RCNN, and a 54.7%/57.7% mIoU on ADE20K semantic segmentation using UPerHead -- all these results set new records. Our work inspires the community to work on unifying upstream pre-training and downstream fine-tuning tasks. Code and the pre-trained models will be released at https://github.com/sunsmarterjie/iTPN.

* 13 pages, 5 figures, 13 tables 
Viaarxiv icon

HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

May 30, 2022
Xiaosong Zhang, Yunjie Tian, Wei Huang, Qixiang Ye, Qi Dai, Lingxi Xie, Qi Tian

Figure 1 for HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling
Figure 2 for HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling
Figure 3 for HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling
Figure 4 for HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling

Recently, masked image modeling (MIM) has offered a new methodology of self-supervised pre-training of vision transformers. A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encoder), which requires the encoder to be a plain vision transformer (e.g., ViT), albeit hierarchical vision transformers (e.g., Swin Transformer) have potentially better properties in formulating vision inputs. In this paper, we offer a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) that enjoys both high efficiency and good performance in MIM. The key is to remove the unnecessary "local inter-unit operations", deriving structurally simple hierarchical vision transformers in which mask-units can be serialized like plain vision transformers. For this purpose, we start with Swin Transformer and (i) set the masking unit size to be the token size in the main stage of Swin Transformer, (ii) switch off inter-unit self-attentions before the main stage, and (iii) eliminate all operations after the main stage. Empirical studies demonstrate the advantageous performance of HiViT in terms of fully-supervised, self-supervised, and transfer learning. In particular, in running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$\times$ speed-up over Swin-B, and the performance gain generalizes to downstream tasks of detection and segmentation. Code will be made publicly available.

Viaarxiv icon

Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

Mar 27, 2022
Yunjie Tian, Lingxi Xie, Jiemin Fang, Mengnan Shi, Junran Peng, Xiaopeng Zhang, Jianbin Jiao, Qi Tian, Qixiang Ye

Figure 1 for Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers
Figure 2 for Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers
Figure 3 for Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers
Figure 4 for Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

The past year has witnessed a rapid development of masked image modeling (MIM). MIM is mostly built upon the vision transformers, which suggests that self-supervised visual representations can be done by masking input image parts while requiring the target model to recover the missing contents. MIM has demonstrated promising results on downstream tasks, yet we are interested in whether there exist other effective ways to `learn by recovering missing contents'. In this paper, we investigate this topic by designing five other learning objectives that follow the same procedure as MIM but degrade the input image in different ways. With extensive experiments, we manage to summarize a few design principles for token-based pre-training of vision transformers. In particular, the best practice is obtained by keeping the original image style and enriching spatial masking with spatial misalignment -- this design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost. The code is available at https://github.com/sunsmarterjie/beyond_masking.

* 20 pages, 5 figures, 3 tables 
Viaarxiv icon

Exploring Complicated Search Spaces with Interleaving-Free Sampling

Dec 05, 2021
Yunjie Tian, Lingxi Xie, Jiemin Fang, Jianbin Jiao, Qixiang Ye, Qi Tian

Figure 1 for Exploring Complicated Search Spaces with Interleaving-Free Sampling
Figure 2 for Exploring Complicated Search Spaces with Interleaving-Free Sampling
Figure 3 for Exploring Complicated Search Spaces with Interleaving-Free Sampling
Figure 4 for Exploring Complicated Search Spaces with Interleaving-Free Sampling

The existing neural architecture search algorithms are mostly working on search spaces with short-distance connections. We argue that such designs, though safe and stable, obstacles the search algorithms from exploring more complicated scenarios. In this paper, we build the search algorithm upon a complicated search space with long-distance connections, and show that existing weight-sharing search algorithms mostly fail due to the existence of \textbf{interleaved connections}. Based on the observation, we present a simple yet effective algorithm named \textbf{IF-NAS}, where we perform a periodic sampling strategy to construct different sub-networks during the search procedure, avoiding the interleaved connections to emerge in any of them. In the proposed search space, IF-NAS outperform both random sampling and previous weight-sharing search algorithms by a significant margin. IF-NAS also generalizes to the micro cell-based spaces which are much easier. Our research emphasizes the importance of macro structure and we look forward to further efforts along this direction.

* 9 pages, 8 figures, 6 tables 
Viaarxiv icon

Semantic-Aware Generation for Self-Supervised Visual Representation Learning

Nov 25, 2021
Yunjie Tian, Lingxi Xie, Xiaopeng Zhang, Jiemin Fang, Haohang Xu, Wei Huang, Jianbin Jiao, Qi Tian, Qixiang Ye

Figure 1 for Semantic-Aware Generation for Self-Supervised Visual Representation Learning
Figure 2 for Semantic-Aware Generation for Self-Supervised Visual Representation Learning
Figure 3 for Semantic-Aware Generation for Self-Supervised Visual Representation Learning
Figure 4 for Semantic-Aware Generation for Self-Supervised Visual Representation Learning

In this paper, we propose a self-supervised visual representation learning approach which involves both generative and discriminative proxies, where we focus on the former part by requiring the target network to recover the original image based on the mid-level features. Different from prior work that mostly focuses on pixel-level similarity between the original and generated images, we advocate for Semantic-aware Generation (SaGe) to facilitate richer semantics rather than details to be preserved in the generated image. The core idea of implementing SaGe is to use an evaluator, a deep network that is pre-trained without labels, for extracting semantic-aware features. SaGe complements the target network with view-specific features and thus alleviates the semantic degradation brought by intensive data augmentations. We execute SaGe on ImageNet-1K and evaluate the pre-trained models on five downstream tasks including nearest neighbor test, linear classification, and fine-scaled image recognition, demonstrating its ability to learn stronger visual representations.

* 13 pages, 5 figures, 11 tables 
Viaarxiv icon

GraFormer: Graph Convolution Transformer for 3D Pose Estimation

Sep 17, 2021
Weixi Zhao, Yunjie Tian, Qixiang Ye, Jianbin Jiao, Weiqiang Wang

Exploiting relations among 2D joints plays a crucial role yet remains semi-developed in 2D-to-3D pose estimation. To alleviate this issue, we propose GraFormer, a novel transformer architecture combined with graph convolution for 3D pose estimation. The proposed GraFormer comprises two repeatedly stacked core modules, GraAttention and ChebGConv block. GraAttention enables all 2D joints to interact in global receptive field without weakening the graph structure information of joints, which introduces vital features for later modules. Unlike vanilla graph convolutions that only model the apparent relationship of joints, ChebGConv block enables 2D joints to interact in the high-order sphere, which formulates their hidden implicit relations. We empirically show the superiority of GraFormer through conducting extensive experiments across popular benchmarks. Specifically, GraFormer outperforms state of the art on Human3.6M dataset while using 18$\%$ parameters. The code is available at https://github.com/Graformer/GraFormer .

* 9 pages, 6 figures 
Viaarxiv icon

Adaptive Linear Span Network for Object Skeleton Detection

Nov 08, 2020
Chang Liu, Yunjie Tian, Jianbin Jiao, Qixiang Ye

Figure 1 for Adaptive Linear Span Network for Object Skeleton Detection
Figure 2 for Adaptive Linear Span Network for Object Skeleton Detection
Figure 3 for Adaptive Linear Span Network for Object Skeleton Detection
Figure 4 for Adaptive Linear Span Network for Object Skeleton Detection

Conventional networks for object skeleton detection are usually hand-crafted. Although effective, they require intensive priori knowledge to configure representative features for objects in different scale granularity.In this paper, we propose adaptive linear span network (AdaLSN), driven by neural architecture search (NAS), to automatically configure and integrate scale-aware features for object skeleton detection. AdaLSN is formulated with the theory of linear span, which provides one of the earliest explanations for multi-scale deep feature fusion. AdaLSN is materialized by defining a mixed unit-pyramid search space, which goes beyond many existing search spaces using unit-level or pyramid-level features.Within the mixed space, we apply genetic architecture search to jointly optimize unit-level operations and pyramid-level connections for adaptive feature space expansion. AdaLSN substantiates its versatility by achieving significantly higher accuracy and latency trade-off compared with state-of-the-arts. It also demonstrates general applicability to image-to-mask tasks such as edge detection and road extraction. Code is available at \href{https://github.com/sunsmarterjie/SDL-Skeleton}{\color{magenta}github.com/sunsmarterjie/SDL-Skeleton}.

* 13 pages, 9 figures 
Viaarxiv icon

Discretization-Aware Architecture Search

Jul 07, 2020
Yunjie Tian, Chang Liu, Lingxi Xie, Jianbin Jiao, Qixiang Ye

Figure 1 for Discretization-Aware Architecture Search
Figure 2 for Discretization-Aware Architecture Search
Figure 3 for Discretization-Aware Architecture Search
Figure 4 for Discretization-Aware Architecture Search

The search cost of neural architecture search (NAS) has been largely reduced by weight-sharing methods. These methods optimize a super-network with all possible edges and operations, and determine the optimal sub-network by discretization, \textit{i.e.}, pruning off weak candidates. The discretization process, performed on either operations or edges, incurs significant inaccuracy and thus the quality of the final architecture is not guaranteed. This paper presents discretization-aware architecture search (DA\textsuperscript{2}S), with the core idea being adding a loss term to push the super-network towards the configuration of desired topology, so that the accuracy loss brought by discretization is largely alleviated. Experiments on standard image classification benchmarks demonstrate the superiority of our approach, in particular, under imbalanced target network configurations that were not studied before.

* 14 pages, 7 figures 
Viaarxiv icon