Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yunhong Wang

YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images

Apr 09, 2024

Chenguang Liu, Guangshuai Gao, Ziyue Huang, Zhenghui Hu, Qingjie Liu, Yunhong Wang

Figure 1 for YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images

Figure 2 for YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images

Figure 3 for YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images

Figure 4 for YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images

Abstract:Detecting objects from aerial images poses significant challenges due to the following factors: 1) Aerial images typically have very large sizes, generally with millions or even hundreds of millions of pixels, while computational resources are limited. 2) Small object size leads to insufficient information for effective detection. 3) Non-uniform object distribution leads to computational resource wastage. To address these issues, we propose YOLC (You Only Look Clusters), an efficient and effective framework that builds on an anchor-free object detector, CenterNet. To overcome the challenges posed by large-scale images and non-uniform object distribution, we introduce a Local Scale Module (LSM) that adaptively searches cluster regions for zooming in for accurate detection. Additionally, we modify the regression loss using Gaussian Wasserstein distance (GWD) to obtain high-quality bounding boxes. Deformable convolution and refinement methods are employed in the detection head to enhance the detection of small objects. We perform extensive experiments on two aerial image datasets, including Visdrone2019 and UAVDT, to demonstrate the effectiveness and superiority of our proposed approach.

* accepted to TITS

Via

Access Paper or Ask Questions

Generic Knowledge Boosted Pre-training For Remote Sensing Images

Jan 21, 2024

Ziyue Huang, Mingming Zhang, Yuan Gong, Qingjie Liu, Yunhong Wang

Figure 1 for Generic Knowledge Boosted Pre-training For Remote Sensing Images

Figure 2 for Generic Knowledge Boosted Pre-training For Remote Sensing Images

Figure 3 for Generic Knowledge Boosted Pre-training For Remote Sensing Images

Figure 4 for Generic Knowledge Boosted Pre-training For Remote Sensing Images

Abstract:Deep learning models are essential for scene classification, change detection, land cover segmentation, and other remote sensing image understanding tasks. Most backbones of existing remote sensing deep learning models are typically initialized by pre-trained weights obtained from ImageNet pre-training (IMP). However, domain gaps exist between remote sensing images and natural images (e.g., ImageNet), making deep learning models initialized by pre-trained weights of IMP perform poorly for remote sensing image understanding. Although some pre-training methods are studied in the remote sensing community, current remote sensing pre-training methods face the problem of vague generalization by only using remote sensing images. In this paper, we propose a novel remote sensing pre-training framework, Generic Knowledge Boosted Remote Sensing Pre-training (GeRSP), to learn robust representations from remote sensing and natural images for remote sensing understanding tasks. GeRSP contains two pre-training branches: (1) A self-supervised pre-training branch is adopted to learn domain-related representations from unlabeled remote sensing images. (2) A supervised pre-training branch is integrated into GeRSP for general knowledge learning from labeled natural images. Moreover, GeRSP combines two pre-training branches using a teacher-student architecture to simultaneously learn representations with general and special knowledge, which generates a powerful pre-trained model for deep learning model initialization. Finally, we evaluate GeRSP and other remote sensing pre-training methods on three downstream tasks, i.e., object detection, semantic segmentation, and scene classification. The extensive experimental results consistently demonstrate that GeRSP can effectively learn robust representations in a unified manner, improving the performance of remote sensing downstream tasks.

* 14 pages, 6 figures

Via

Access Paper or Ask Questions

Understanding Heterophily for Graph Neural Networks

Jan 17, 2024

Junfu Wang, Yuanfang Guo, Liang Yang, Yunhong Wang

Abstract:Graphs with heterophily have been regarded as challenging scenarios for Graph Neural Networks (GNNs), where nodes are connected with dissimilar neighbors through various patterns. In this paper, we present theoretical understandings of the impacts of different heterophily patterns for GNNs by incorporating the graph convolution (GC) operations into fully connected networks via the proposed Heterophilous Stochastic Block Models (HSBM), a general random graph model that can accommodate diverse heterophily patterns. Firstly, we show that by applying a GC operation, the separability gains are determined by two factors, i.e., the Euclidean distance of the neighborhood distributions and $\sqrt{\mathbb{E}\left[\operatorname{deg}\right]}$, where $\mathbb{E}\left[\operatorname{deg}\right]$ is the averaged node degree. It reveals that the impact of heterophily on classification needs to be evaluated alongside the averaged node degree. Secondly, we show that the topological noise has a detrimental impact on separability, which is equivalent to degrading $\mathbb{E}\left[\operatorname{deg}\right]$. Finally, when applying multiple GC operations, we show that the separability gains are determined by the normalized distance of the $l$-powered neighborhood distributions. It indicates that the nodes still possess separability as $l$ goes to infinity in a wide range of regimes. Extensive experiments on both synthetic and real-world data verify the effectiveness of our theory.

Via

Access Paper or Ask Questions

Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Dec 01, 2023

Yajie Liu, Pu Ge, Haoxiang Ma, Shichao Fan, Qingjie Liu, Di Huang, Yunhong Wang

Figure 1 for Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Figure 2 for Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Figure 3 for Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Figure 4 for Towards Generalizable Referring Image Segmentation via Target Prompt and Visual Coherence

Abstract:Referring image segmentation (RIS) aims to segment objects in an image conditioning on free-from text descriptions. Despite the overwhelming progress, it still remains challenging for current approaches to perform well on cases with various text expressions or with unseen visual entities, limiting its further application. In this paper, we present a novel RIS approach, which substantially improves the generalization ability by addressing the two dilemmas mentioned above. Specially, to deal with unconstrained texts, we propose to boost a given expression with an explicit and crucial prompt, which complements the expression in a unified context, facilitating target capturing in the presence of linguistic style changes. Furthermore, we introduce a multi-modal fusion aggregation module with visual guidance from a powerful pretrained model to leverage spatial relations and pixel coherences to handle the incomplete target masks and false positive irregular clumps which often appear on unseen visual entities. Extensive experiments are conducted in the zero-shot cross-dataset settings and the proposed approach achieves consistent gains compared to the state-of-the-art, e.g., 4.15\%, 5.45\%, and 4.64\% mIoU increase on RefCOCO, RefCOCO+ and ReferIt respectively, demonstrating its effectiveness. Additionally, the results on GraspNet-RIS show that our approach also generalizes well to new scenarios with large domain shifts.

* 7 pages, 4 figures

Via

Access Paper or Ask Questions

DUA-DA: Distillation-based Unbiased Alignment for Domain Adaptive Object Detection

Nov 17, 2023

Yongchao Feng, Shiwei Li, Yingjie Gao, Ziyue Huang, Yanan Zhang, Qingjie Liu, Yunhong Wang

Figure 1 for DUA-DA: Distillation-based Unbiased Alignment for Domain Adaptive Object Detection

Figure 2 for DUA-DA: Distillation-based Unbiased Alignment for Domain Adaptive Object Detection

Figure 3 for DUA-DA: Distillation-based Unbiased Alignment for Domain Adaptive Object Detection

Figure 4 for DUA-DA: Distillation-based Unbiased Alignment for Domain Adaptive Object Detection

Abstract:Though feature-alignment based Domain Adaptive Object Detection (DAOD) have achieved remarkable progress, they ignore the source bias issue, i.e. the aligned features are more favorable towards the source domain, leading to a sub-optimal adaptation. Furthermore, the presence of domain shift between the source and target domains exacerbates the problem of inconsistent classification and localization in general detection pipelines. To overcome these challenges, we propose a novel Distillation-based Unbiased Alignment (DUA) framework for DAOD, which can distill the source features towards a more balanced position via a pre-trained teacher model during the training process, alleviating the problem of source bias effectively. In addition, we design a Target-Relevant Object Localization Network (TROLN), which can mine target-related knowledge to produce two classification-free metrics (IoU and centerness). Accordingly, we implement a Domain-aware Consistency Enhancing (DCE) strategy that utilizes these two metrics to further refine classification confidences, achieving a harmonization between classification and localization in cross-domain scenarios. Extensive experiments have been conducted to manifest the effectiveness of this method, which consistently improves the strong baseline by large margins, outperforming existing alignment-based works.

* 10pages,5 figures

Via

Access Paper or Ask Questions

ActiveDC: Distribution Calibration for Active Finetuning

Nov 15, 2023

Wenshuai Xu, Zhenhui Hu, Yu Lu, Jinzhou Meng, Qingjie Liu, Yunhong Wang

Abstract:The pretraining-finetuning paradigm has gained popularity in various computer vision tasks. In this paradigm, the emergence of active finetuning arises due to the abundance of large-scale data and costly annotation requirements. Active finetuning involves selecting a subset of data from an unlabeled pool for annotation, facilitating subsequent finetuning. However, the use of a limited number of training samples can lead to a biased distribution, potentially resulting in model overfitting. In this paper, we propose a new method called ActiveDC for the active finetuning tasks. Firstly, we select samples for annotation by optimizing the distribution similarity between the subset to be selected and the entire unlabeled pool in continuous space. Secondly, we calibrate the distribution of the selected samples by exploiting implicit category information in the unlabeled pool. The feature visualization provides an intuitive sense of the effectiveness of our approach to distribution calibration. We conducted extensive experiments on three image classification datasets with different sampling ratios. The results indicate that ActiveDC consistently outperforms the baseline performance in all image classification tasks. The improvement is particularly significant when the sampling ratio is low, with performance gains of up to 10%. Our code will be released.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

Learning Historical Status Prompt for Accurate and Robust Visual Tracking

Nov 03, 2023

Wenrui Cai, Qingjie Liu, Yunhong Wang

Figure 1 for Learning Historical Status Prompt for Accurate and Robust Visual Tracking

Figure 2 for Learning Historical Status Prompt for Accurate and Robust Visual Tracking

Figure 3 for Learning Historical Status Prompt for Accurate and Robust Visual Tracking

Figure 4 for Learning Historical Status Prompt for Accurate and Robust Visual Tracking

Abstract:Most trackers perform template and search region similarity matching to find the most similar object to the template during tracking. However, they struggle to make prediction when the target appearance changes due to the limited historical information introduced by roughly cropping the current search region based on the predicted result of previous frame. In this paper, we identify that the central impediment to improving the performance of existing trackers is the incapacity to integrate abundant and effective historical information. To address this issue, we propose a Historical Information Prompter (HIP) to enhance the provision of historical information. We also build HIPTrack upon HIP module. HIP is a plug-and-play module that make full use of search region features to introduce historical appearance information. It also incorporates historical position information by constructing refined mask of the target. HIP is a lightweight module to generate historical information prompts. By integrating historical information prompts, HIPTrack significantly enhances the tracking performance without the need to retrain the backbone. Experimental results demonstrate that our method outperforms all state-of-the-art approaches on LaSOT, LaSOT ext, GOT10k and NfS. Futhermore, HIP module exhibits strong generality and can be seamlessly integrated into trackers to improve tracking performance. The source code and models will be released for further research.

Via

Access Paper or Ask Questions

Improving Multi-Person Pose Tracking with A Confidence Network

Oct 29, 2023

Zehua Fu, Wenhang Zuo, Zhenghui Hu, Qingjie Liu, Yunhong Wang

Figure 1 for Improving Multi-Person Pose Tracking with A Confidence Network

Figure 2 for Improving Multi-Person Pose Tracking with A Confidence Network

Figure 3 for Improving Multi-Person Pose Tracking with A Confidence Network

Figure 4 for Improving Multi-Person Pose Tracking with A Confidence Network

Abstract:Human pose estimation and tracking are fundamental tasks for understanding human behaviors in videos. Existing top-down framework-based methods usually perform three-stage tasks: human detection, pose estimation and tracking. Although promising results have been achieved, these methods rely heavily on high-performance detectors and may fail to track persons who are occluded or miss-detected. To overcome these problems, in this paper, we develop a novel keypoint confidence network and a tracking pipeline to improve human detection and pose estimation in top-down approaches. Specifically, the keypoint confidence network is designed to determine whether each keypoint is occluded, and it is incorporated into the pose estimation module. In the tracking pipeline, we propose the Bbox-revision module to reduce missing detection and the ID-retrieve module to correct lost trajectories, improving the performance of the detection stage. Experimental results show that our approach is universal in human detection and pose estimation, achieving state-of-the-art performance on both PoseTrack 2017 and 2018 datasets.

* Accepted by IEEE Transactions on Multimedia. 11 pages, 5 figures

Via

Access Paper or Ask Questions

Incremental Object Detection with CLIP

Oct 13, 2023

Yupeng He, Ziyue Huang, Qingjie Liu, Yunhong Wang

Figure 1 for Incremental Object Detection with CLIP

Figure 2 for Incremental Object Detection with CLIP

Figure 3 for Incremental Object Detection with CLIP

Figure 4 for Incremental Object Detection with CLIP

Abstract:In the incremental detection task, unlike the incremental classification task, data ambiguity exists due to the possibility of an image having different labeled bounding boxes in multiple continuous learning stages. This phenomenon often impairs the model's ability to learn new classes. However, the forward compatibility of the model is less considered in existing work, which hinders the model's suitability for incremental learning. To overcome this obstacle, we propose to use a language-visual model such as CLIP to generate text feature embeddings for different class sets, which enhances the feature space globally. We then employ the broad classes to replace the unavailable novel classes in the early learning stage to simulate the actual incremental scenario. Finally, we use the CLIP image encoder to identify potential objects in the proposals, which are classified into the background by the model. We modify the background labels of those proposals to known classes and add the boxes to the training set to alleviate the problem of data ambiguity. We evaluate our approach on various incremental learning settings on the PASCAL VOC 2007 dataset, and our approach outperforms state-of-the-art methods, particularly for the new classes.

* 10 pages, 2 figures

Via

Access Paper or Ask Questions

Context-Enhanced Detector For Building Detection From Remote Sensing Images

Oct 11, 2023

Ziyue Huang, Mingming Zhang, Qingjie Liu, Wei Wang, Zhe Dong, Yunhong Wang

Abstract:The field of building detection from remote sensing images has made significant progress, but faces challenges in achieving high-accuracy detection due to the diversity in building appearances and the complexity of vast scenes. To address these challenges, we propose a novel approach called Context-Enhanced Detector (CEDet). Our approach utilizes a three-stage cascade structure to enhance the extraction of contextual information and improve building detection accuracy. Specifically, we introduce two modules: the Semantic Guided Contextual Mining (SGCM) module, which aggregates multi-scale contexts and incorporates an attention mechanism to capture long-range interactions, and the Instance Context Mining Module (ICMM), which captures instance-level relationship context by constructing a spatial relationship graph and aggregating instance features. Additionally, we introduce a semantic segmentation loss based on pseudo-masks to guide contextual information extraction. Our method achieves state-of-the-art performance on three building detection benchmarks, including CNBuilding-9P, CNBuilding-23P, and SpaceNet.

* 12 pages, 7 figures

Via

Access Paper or Ask Questions