Alert button
Picture for Jianyuan Guo

Jianyuan Guo

Alert button

Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism

Sep 21, 2023
Chengcheng Wang, Wei He, Ying Nie, Jianyuan Guo, Chuanjian Liu, Kai Han, Yunhe Wang

In the past years, YOLO-series models have emerged as the leading approaches in the area of real-time object detection. Many studies pushed up the baseline to a higher level by modifying the architecture, augmenting data and designing new losses. However, we find previous models still suffer from information fusion problem, although Feature Pyramid Network (FPN) and Path Aggregation Network (PANet) have alleviated this. Therefore, this study provides an advanced Gatherand-Distribute mechanism (GD) mechanism, which is realized with convolution and self-attention operations. This new designed model named as Gold-YOLO, which boosts the multi-scale feature fusion capabilities and achieves an ideal balance between latency and accuracy across all model scales. Additionally, we implement MAE-style pretraining in the YOLO-series for the first time, allowing YOLOseries models could be to benefit from unsupervised pretraining. Gold-YOLO-N attains an outstanding 39.9% AP on the COCO val2017 datasets and 1030 FPS on a T4 GPU, which outperforms the previous SOTA model YOLOv6-3.0-N with similar FPS by +2.4%. The PyTorch code is available at https://github.com/huawei-noah/Efficient-Computing/tree/master/Detection/Gold-YOLO, and the MindSpore code is available at https://gitee.com/mindspore/models/tree/master/research/cv/Gold_YOLO.

Viaarxiv icon

ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks

Jun 26, 2023
Kai Han, Yunhe Wang, Jianyuan Guo, Enhua Wu

Figure 1 for ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks
Figure 2 for ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks
Figure 3 for ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks
Figure 4 for ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks

The large-scale visual pretraining has significantly improve the performance of large vision models. However, we observe the \emph{low FLOPs pitfall} that the existing low-FLOPs models cannot benefit from large-scale pretraining. In this paper, we propose a general design principle of adding more parameters while maintaining low FLOPs for large-scale visual pretraining, named as ParameterNet. Dynamic convolutions are used for instance to equip the networks with more parameters and only slightly increase the FLOPs. The proposed ParameterNet scheme enables low-FLOPs networks to benefit from large-scale visual pretraining. Experiments on the large-scale ImageNet-22K have shown the superiority of our ParameterNet scheme. For example, ParameterNet-600M can achieve higher accuracy than the widely-used Swin Transformer (81.6\% \emph{vs.} 80.9\%) and has much lower FLOPs (0.6G \emph{vs.} 4.5G). The code will be released as soon (MindSpore: https://gitee.com/mindspore/models, PyTorch: https://github.com/huawei-noah/Efficient-AI-Backbones).

Viaarxiv icon

VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale

May 25, 2023
Zhiwei Hao, Jianyuan Guo, Kai Han, Han Hu, Chang Xu, Yunhe Wang

Figure 1 for VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale
Figure 2 for VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale
Figure 3 for VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale
Figure 4 for VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale

The tremendous success of large models trained on extensive datasets demonstrates that scale is a key ingredient in achieving superior results. Therefore, the reflection on the rationality of designing knowledge distillation (KD) approaches for limited-capacity architectures solely based on small-scale datasets is now deemed imperative. In this paper, we identify the \emph{small data pitfall} that presents in previous KD methods, which results in the underestimation of the power of vanilla KD framework on large-scale datasets such as ImageNet-1K. Specifically, we show that employing stronger data augmentation techniques and using larger datasets can directly decrease the gap between vanilla KD and other meticulously designed KD variants. This highlights the necessity of designing and evaluating KD approaches in the context of practical scenarios, casting off the limitations of small-scale datasets. Our investigation of the vanilla KD and its variants in more complex schemes, including stronger training strategies and different model capacities, demonstrates that vanilla KD is elegantly simple but astonishingly effective in large-scale scenarios. Without bells and whistles, we obtain state-of-the-art ResNet-50, ViT-S, and ConvNeXtV2-T models for ImageNet, which achieve 83.1\%, 84.3\%, and 85.0\% top-1 accuracy, respectively. PyTorch code and checkpoints can be found at https://github.com/Hao840/vanillaKD.

Viaarxiv icon

VanillaNet: the Power of Minimalism in Deep Learning

May 23, 2023
Hanting Chen, Yunhe Wang, Jianyuan Guo, Dacheng Tao

Figure 1 for VanillaNet: the Power of Minimalism in Deep Learning
Figure 2 for VanillaNet: the Power of Minimalism in Deep Learning
Figure 3 for VanillaNet: the Power of Minimalism in Deep Learning
Figure 4 for VanillaNet: the Power of Minimalism in Deep Learning

At the heart of foundation models is the philosophy of "more is different", exemplified by the astonishing success in computer vision and natural language processing. However, the challenges of optimization and inherent complexity of transformer models call for a paradigm shift towards simplicity. In this study, we introduce VanillaNet, a neural network architecture that embraces elegance in design. By avoiding high depth, shortcuts, and intricate operations like self-attention, VanillaNet is refreshingly concise yet remarkably powerful. Each layer is carefully crafted to be compact and straightforward, with nonlinear activation functions pruned after training to restore the original architecture. VanillaNet overcomes the challenges of inherent complexity, making it ideal for resource-constrained environments. Its easy-to-understand and highly simplified architecture opens new possibilities for efficient deployment. Extensive experimentation demonstrates that VanillaNet delivers performance on par with renowned deep neural networks and vision transformers, showcasing the power of minimalism in deep learning. This visionary journey of VanillaNet has significant potential to redefine the landscape and challenge the status quo of foundation model, setting a new path for elegant and effective model design. Pre-trained models and codes are available at https://github.com/huawei-noah/VanillaNet and https://gitee.com/mindspore/models/tree/master/research/cv/vanillanet.

Viaarxiv icon

Masked Image Modeling with Local Multi-Scale Reconstruction

Mar 09, 2023
Haoqing Wang, Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhi-Hong Deng, Kai Han

Figure 1 for Masked Image Modeling with Local Multi-Scale Reconstruction
Figure 2 for Masked Image Modeling with Local Multi-Scale Reconstruction
Figure 3 for Masked Image Modeling with Local Multi-Scale Reconstruction
Figure 4 for Masked Image Modeling with Local Multi-Scale Reconstruction

Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning. Unfortunately, MIM models typically have huge computational burden and slow learning process, which is an inevitable obstacle for their industrial applications. Although the lower layers play the key role in MIM, existing MIM models conduct reconstruction task only at the top layer of encoder. The lower layers are not explicitly guided and the interaction among their patches is only used for calculating new activations. Considering the reconstruction task requires non-trivial inter-patch interactions to reason target signals, we apply it to multiple local layers including lower and upper layers. Further, since the multiple layers expect to learn the information of different scales, we design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively. This design not only accelerates the representation learning process by explicitly guiding multiple layers, but also facilitates multi-scale semantical understanding to the input. Extensive experiments show that with significantly less pre-training burden, our model achieves comparable or better performance on classification, detection and segmentation tasks than existing MIM models.

* CVPR 2023 
Viaarxiv icon

FastMIM: Expediting Masked Image Modeling Pre-training for Vision

Dec 13, 2022
Jianyuan Guo, Kai Han, Han Wu, Yehui Tang, Yunhe Wang, Chang Xu

Figure 1 for FastMIM: Expediting Masked Image Modeling Pre-training for Vision
Figure 2 for FastMIM: Expediting Masked Image Modeling Pre-training for Vision
Figure 3 for FastMIM: Expediting Masked Image Modeling Pre-training for Vision
Figure 4 for FastMIM: Expediting Masked Image Modeling Pre-training for Vision

The combination of transformers and masked image modeling (MIM) pre-training framework has shown great potential in various vision tasks. However, the pre-training computational budget is too heavy and withholds the MIM from becoming a practical training paradigm. This paper presents FastMIM, a simple and generic framework for expediting masked image modeling with the following two steps: (i) pre-training vision backbones with low-resolution input images; and (ii) reconstructing Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images. In addition, we propose FastMIM-P to progressively enlarge the input resolution during pre-training stage to further enhance the transfer results of models with high capacity. We point out that: (i) a wide range of input resolutions in pre-training phase can lead to similar performances in fine-tuning phase and downstream tasks such as detection and segmentation; (ii) the shallow layers of encoder are more important during pre-training and discarding last several layers can speed up the training stage with no harm to fine-tuning performance; (iii) the decoder should match the size of selected network; and (iv) HOG is more stable than RGB values when resolution transfers;. Equipped with FastMIM, all kinds of vision backbones can be pre-trained in an efficient way. For example, we can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones. Compared to previous relevant approaches, we can achieve comparable or better top-1 accuracy while accelerate the training procedure by $\sim$5$\times$. Code can be found in https://github.com/ggjy/FastMIM.pytorch.

* Tech report 
Viaarxiv icon

GhostNetV2: Enhance Cheap Operation with Long-Range Attention

Nov 23, 2022
Yehui Tang, Kai Han, Jianyuan Guo, Chang Xu, Chao Xu, Yunhe Wang

Figure 1 for GhostNetV2: Enhance Cheap Operation with Long-Range Attention
Figure 2 for GhostNetV2: Enhance Cheap Operation with Long-Range Attention
Figure 3 for GhostNetV2: Enhance Cheap Operation with Long-Range Attention
Figure 4 for GhostNetV2: Enhance Cheap Operation with Long-Range Attention

Light-weight convolutional neural networks (CNNs) are specially designed for applications on mobile devices with faster inference speed. The convolutional operation can only capture local information in a window region, which prevents performance from being further improved. Introducing self-attention into convolution can capture global information well, but it will largely encumber the actual speed. In this paper, we propose a hardware-friendly attention mechanism (dubbed DFC attention) and then present a new GhostNetV2 architecture for mobile applications. The proposed DFC attention is constructed based on fully-connected layers, which can not only execute fast on common hardware but also capture the dependence between long-range pixels. We further revisit the expressiveness bottleneck in previous GhostNet and propose to enhance expanded features produced by cheap operations with DFC attention, so that a GhostNetV2 block can aggregate local and long-range information simultaneously. Extensive experiments demonstrate the superiority of GhostNetV2 over existing architectures. For example, it achieves 75.3% top-1 accuracy on ImageNet with 167M FLOPs, significantly suppressing GhostNetV1 (74.5%) with a similar computational cost. The source code will be available at https://github.com/huawei-noah/Efficient-AI-Backbones/tree/master/ghostnetv2_pytorch and https://gitee.com/mindspore/models/tree/master/research/cv/ghostnetv2.

* This paper is accepted by NeurIPS 2022 (Spotlight) 
Viaarxiv icon

Hierarchical Relational Learning for Few-Shot Knowledge Graph Completion

Sep 16, 2022
Han Wu, Jie Yin, Bala Rajaratnam, Jianyuan Guo

Figure 1 for Hierarchical Relational Learning for Few-Shot Knowledge Graph Completion
Figure 2 for Hierarchical Relational Learning for Few-Shot Knowledge Graph Completion
Figure 3 for Hierarchical Relational Learning for Few-Shot Knowledge Graph Completion
Figure 4 for Hierarchical Relational Learning for Few-Shot Knowledge Graph Completion

Knowledge graphs (KGs) are known for their large scale and knowledge inference ability, but are also notorious for the incompleteness associated with them. Due to the long-tail distribution of the relations in KGs, few-shot KG completion has been proposed as a solution to alleviate incompleteness and expand the coverage of KGs. It aims to make predictions for triplets involving novel relations when only a few training triplets are provided as reference. Previous methods have mostly focused on designing local neighbor aggregators to learn entity-level information and/or imposing sequential dependency assumption at the triplet level to learn meta relation information. However, valuable pairwise triplet-level interactions and context-level relational information have been largely overlooked for learning meta representations of few-shot relations. In this paper, we propose a hierarchical relational learning method (HiRe) for few-shot KG completion. By jointly capturing three levels of relational information (entity-level, triplet-level and context-level), HiRe can effectively learn and refine the meta representation of few-shot relations, and consequently generalize very well to new unseen relations. Extensive experiments on two benchmark datasets validate the superiority of HiRe against other state-of-the-art methods.

* 10 pages, 5 figures 
Viaarxiv icon