Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qiming Zhang

Frank

Revolutionizing Agrifood Systems with Artificial Intelligence: A Survey

May 03, 2023

Tao Chen, Liang Lv, Di Wang, Jing Zhang, Yue Yang, Zeyang Zhao, Chen Wang, Xiaowei Guo, Hao Chen, Qingye Wang(+5 more)

Figure 1 for Revolutionizing Agrifood Systems with Artificial Intelligence: A Survey

Figure 2 for Revolutionizing Agrifood Systems with Artificial Intelligence: A Survey

Figure 3 for Revolutionizing Agrifood Systems with Artificial Intelligence: A Survey

Figure 4 for Revolutionizing Agrifood Systems with Artificial Intelligence: A Survey

Abstract:With the world population rapidly increasing, transforming our agrifood systems to be more productive, efficient, safe, and sustainable is crucial to mitigate potential food shortages. Recently, artificial intelligence (AI) techniques such as deep learning (DL) have demonstrated their strong abilities in various areas, including language, vision, remote sensing (RS), and agrifood systems applications. However, the overall impact of AI on agrifood systems remains unclear. In this paper, we thoroughly review how AI techniques can transform agrifood systems and contribute to the modern agrifood industry. Firstly, we summarize the data acquisition methods in agrifood systems, including acquisition, storage, and processing techniques. Secondly, we present a progress review of AI methods in agrifood systems, specifically in agriculture, animal husbandry, and fishery, covering topics such as agrifood classification, growth monitoring, yield prediction, and quality assessment. Furthermore, we highlight potential challenges and promising research opportunities for transforming modern agrifood systems with AI. We hope this survey could offer an overall picture to newcomers in the field and serve as a starting point for their further research.

* Submitted to ACM

Via

Access Paper or Ask Questions

BEVSimDet: Simulated Multi-modal Distillation in Bird's-Eye View for Multi-view 3D Object Detection

Apr 15, 2023

Haimei Zhao, Qiming Zhang, Shanshan Zhao, Jing Zhang, Dacheng Tao

Figure 1 for BEVSimDet: Simulated Multi-modal Distillation in Bird's-Eye View for Multi-view 3D Object Detection

Figure 2 for BEVSimDet: Simulated Multi-modal Distillation in Bird's-Eye View for Multi-view 3D Object Detection

Figure 3 for BEVSimDet: Simulated Multi-modal Distillation in Bird's-Eye View for Multi-view 3D Object Detection

Figure 4 for BEVSimDet: Simulated Multi-modal Distillation in Bird's-Eye View for Multi-view 3D Object Detection

Abstract:Multi-view camera-based 3D object detection has gained popularity due to its low cost. But accurately inferring 3D geometry solely from camera data remains challenging, which impacts model performance. One promising approach to address this issue is to distill precise 3D geometry knowledge from LiDAR data. However, transferring knowledge between different sensor modalities is hindered by the significant modality gap. In this paper, we approach this challenge from the perspective of both architecture design and knowledge distillation and present a new simulated multi-modal 3D object detection method named BEVSimDet. We first introduce a novel framework that includes a LiDAR and camera fusion-based teacher and a simulated multi-modal student, where the student simulates multi-modal features with image-only input. To facilitate effective distillation, we propose a simulated multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal distillation simultaneously, in Bird's-eye-view (BEV) space. By combining them together, BEVSimDet can learn better feature representations for 3D object detection while enjoying cost-effective camera-only deployment. Experimental results on the challenging nuScenes benchmark demonstrate the effectiveness and superiority of BEVSimDet over recent representative methods. The source code will be released at \href{https://github.com/ViTAE-Transformer/BEVSimDet}{BEVSimDet}.

* 15 pages; add link

Via

Access Paper or Ask Questions

Vision Transformer with Quadrangle Attention

Mar 27, 2023

Qiming Zhang, Jing Zhang, Yufei Xu, Dacheng Tao

Figure 1 for Vision Transformer with Quadrangle Attention

Figure 2 for Vision Transformer with Quadrangle Attention

Figure 3 for Vision Transformer with Quadrangle Attention

Figure 4 for Vision Transformer with Quadrangle Attention

Abstract:Window-based attention has become a popular choice in vision transformers due to its superior performance, lower computational complexity, and less memory footprint. However, the design of hand-crafted windows, which is data-agnostic, constrains the flexibility of transformers to adapt to objects of varying sizes, shapes, and orientations. To address this issue, we propose a novel quadrangle attention (QA) method that extends the window-based attention to a general quadrangle formulation. Our method employs an end-to-end learnable quadrangle regression module that predicts a transformation matrix to transform default windows into target quadrangles for token sampling and attention calculation, enabling the network to model various targets with different shapes and orientations and capture rich context information. We integrate QA into plain and hierarchical vision transformers to create a new architecture named QFormer, which offers minor code modifications and negligible extra computational cost. Extensive experiments on public benchmarks demonstrate that QFormer outperforms existing representative vision transformers on various vision tasks, including classification, object detection, semantic segmentation, and pose estimation. The code will be made publicly available at \href{https://github.com/ViTAE-Transformer/QFormer}{QFormer}.

* 15 pages, the extension of the ECCV 2022 paper (VSA: Learning Varied-Size Window Attention in Vision Transformers)

Via

Access Paper or Ask Questions

ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation

Dec 07, 2022

Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao

Abstract:In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose+ model is proposed to deal with heterogeneous body keypoint categories in different types of body pose estimation tasks via knowledge factorization, i.e., adopting task-agnostic and task-specific feed-forward networks in the transformer. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our ViTPose model outperforms representative methods on the challenging MS COCO Human Keypoint Detection benchmark at both top-down and bottom-up settings. Furthermore, our ViTPose+ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.

* Extension of ViTPose paper

Via

Access Paper or Ask Questions

1st Workshop on Maritime Computer Vision 2023: Challenge Results

Nov 28, 2022

Benjamin Kiefer, Matej Kristan, Janez Perš, Lojze Žust, Fabio Poiesi, Fabio Augusto de Alcantara Andrade, Alexandre Bernardino, Matthew Dawkins, Jenni Raitoharju, Yitong Quan(+63 more)

Figure 1 for 1st Workshop on Maritime Computer Vision 2023: Challenge Results

Figure 2 for 1st Workshop on Maritime Computer Vision 2023: Challenge Results

Figure 3 for 1st Workshop on Maritime Computer Vision 2023: Challenge Results

Figure 4 for 1st Workshop on Maritime Computer Vision 2023: Challenge Results

Abstract:The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.

* MaCVi 2023 was part of WACV 2023. This report (38 pages) discusses the competition as part of MaCVi

Via

Access Paper or Ask Questions

Rethinking Hierarchies in Pre-trained Plain Vision Transformer

Nov 08, 2022

Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao

Abstract:Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. However, customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. More importantly, since these hierarchical ViTs cannot reuse the off-the-shelf pre-trained weights of the plain ViTs, the requirement of pre-training them leads to a massive amount of computational cost, thereby incurring both algorithmic and computational complexity. In this paper, we address this problem by proposing a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training. We transform the plain ViT into a hierarchical one with minimal changes. Technically, we change the stride of linear embedding layer from 16 to 4 and add convolution (or simple average) pooling layers between the transformer blocks, thereby reducing the feature size from 1/4 to 1/32 sequentially. Despite its simplicity, it outperforms the plain ViT baseline in classification, detection, and segmentation tasks on ImageNet, MS COCO, Cityscapes, and ADE20K benchmarks, respectively. We hope this preliminary study could draw more attention from the community on developing effective (hierarchical) ViTs while avoiding the pre-training cost by leveraging the off-the-shelf checkpoints. The code and models will be released at https://github.com/ViTAE-Transformer/HPViT.

* Tech report, work in progress

Via

Access Paper or Ask Questions

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

Aug 10, 2022

Di Wang, Qiming Zhang, Yufei Xu, Jing Zhang, Bo Du, Dacheng Tao, Liangpei Zhang

Figure 1 for Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

Figure 2 for Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

Figure 3 for Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

Figure 4 for Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

Abstract:Large-scale vision foundation models have made significant progress in visual tasks on natural images, where the vision transformers are the primary choice for their good scalability and representation ability. However, the utilization of large models in the remote sensing (RS) community remains under-explored where existing models are still at small-scale, which limits the performance. In this paper, we resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for RS tasks and explore how such large models perform. Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention to substitute the original full attention in transformers, which could significantly reduce the computational cost and memory footprint while learn better object representation by extracting rich context from the generated diverse windows. Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset. The results of our models on downstream classification and segmentation tasks also demonstrate competitive performance compared with the existing advanced methods. Further experiments show the advantages of our models on computational complexity and few-shot learning.

* The code and models will be released at https://github.com/ViTAE-Transformer/Remote-Sensing-RVSA

Via

Access Paper or Ask Questions

Toward Real-world Single Image Deraining: A New Benchmark and Beyond

Jun 11, 2022

Wei Li, Qiming Zhang, Jing Zhang, Zhen Huang, Xinmei Tian, Dacheng Tao

Figure 1 for Toward Real-world Single Image Deraining: A New Benchmark and Beyond

Figure 2 for Toward Real-world Single Image Deraining: A New Benchmark and Beyond

Figure 3 for Toward Real-world Single Image Deraining: A New Benchmark and Beyond

Figure 4 for Toward Real-world Single Image Deraining: A New Benchmark and Beyond

Abstract:Single image deraining (SID) in real scenarios attracts increasing attention in recent years. Due to the difficulty in obtaining real-world rainy/clean image pairs, previous real datasets suffer from low-resolution images, homogeneous rain streaks, limited background variation, and even misalignment of image pairs, resulting in incomprehensive evaluation of SID methods. To address these issues, we establish a new high-quality dataset named RealRain-1k, consisting of $1,120$ high-resolution paired clean and rainy images with low- and high-density rain streaks, respectively. Images in RealRain-1k are automatically generated from a large number of real-world rainy video clips through a simple yet effective rain density-controllable filtering method, and have good properties of high image resolution, background diversity, rain streaks variety, and strict spatial alignment. RealRain-1k also provides abundant rain streak layers as a byproduct, enabling us to build a large-scale synthetic dataset named SynRain-13k by pasting the rain streak layers on abundant natural images. Based on them and existing datasets, we benchmark more than 10 representative SID methods on three tracks: (1) fully supervised learning on RealRain-1k, (2) domain generalization to real datasets, and (3) syn-to-real transfer learning. The experimental results (1) show the difference of representative methods in image restoration performance and model complexity, (2) validate the significance of the proposed datasets for model generalization, and (3) provide useful insights on the superiority of learning from diverse domains and shed lights on the future research on real-world SID. The datasets will be released at https://github.com/hiker-lw/RealRain-1k

Via

Access Paper or Ask Questions

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Apr 26, 2022

Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao

Figure 1 for ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Figure 2 for ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Figure 3 for ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Figure 4 for ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Abstract:Recently, customized vision transformers have been adapted for human pose estimation and have achieved superior performance with elaborate structures. However, it is still unclear whether plain vision transformers can facilitate pose estimation. In this paper, we take the first step toward answering the question by employing a plain and non-hierarchical vision transformer together with simple deconvolution decoders termed ViTPose for human pose estimation. We demonstrate that a plain vision transformer with MAE pretraining can obtain superior performance after finetuning on human pose estimation datasets. ViTPose has good scalability with respect to model size and flexibility regarding input resolution and token number. Moreover, it can be easily pretrained using the unlabeled pose data without the need for large-scale upstream ImageNet data. Our biggest ViTPose model based on the ViTAE-G backbone with 1 billion parameters obtains the best 80.9 mAP on the MS COCO test-dev set, while the ensemble models further set a new state-of-the-art for human pose estimation, i.e., 81.1 mAP. The source code and models will be released at https://github.com/ViTAE-Transformer/ViTPose.

* Tech report. 81.1 mAP on MS COCO Keypoint Detection test-dev set

Via

Access Paper or Ask Questions

VSA: Learning Varied-Size Window Attention in Vision Transformers

Apr 18, 2022

Qiming Zhang, Yufei Xu, Jing Zhang, Dacheng Tao

Figure 1 for VSA: Learning Varied-Size Window Attention in Vision Transformers

Figure 2 for VSA: Learning Varied-Size Window Attention in Vision Transformers

Figure 3 for VSA: Learning Varied-Size Window Attention in Vision Transformers

Figure 4 for VSA: Learning Varied-Size Window Attention in Vision Transformers

Abstract:Attention within windows has been widely explored in vision transformers to balance the performance, computation complexity, and memory footprint. However, current models adopt a hand-crafted fixed-size window design, which restricts their capacity of modeling long-term dependencies and adapting to objects of different sizes. To address this drawback, we propose \textbf{V}aried-\textbf{S}ize Window \textbf{A}ttention (VSA) to learn adaptive window configurations from data. Specifically, based on the tokens within each default window, VSA employs a window regression module to predict the size and location of the target window, i.e., the attention area where the key and value tokens are sampled. By adopting VSA independently for each attention head, it can model long-term dependencies, capture rich context from diverse windows, and promote information exchange among overlapped windows. VSA is an easy-to-implement module that can replace the window attention in state-of-the-art representative models with minor modifications and negligible extra computational cost while improving their performance by a large margin, e.g., 1.1\% for Swin-T on ImageNet classification. In addition, the performance gain increases when using larger images for training and test. Experimental results on more downstream tasks, including object detection, instance segmentation, and semantic segmentation, further demonstrate the superiority of VSA over the vanilla window attention in dealing with objects of different sizes. The code will be released https://github.com/ViTAE-Transformer/ViTAE-VSA.

* 23 pages, 13 tables, and 5 figures

Via

Access Paper or Ask Questions