Objects in aerial images usually have arbitrary orientations and are densely located over the ground, making them extremely challenge to be detected. Many recently developed methods attempt to solve these issues by estimating an extra orientation parameter and placing dense anchors, which will result in high model complexity and computational costs. In this paper, we propose an arbitrary-oriented region proposal network (AO-RPN) to generate oriented proposals transformed from horizontal anchors. The AO-RPN is very efficient with only a few amounts of parameters increase than the original RPN. Furthermore, to obtain accurate bounding boxes, we decouple the detection task into multiple subtasks and propose a multi-head network to accomplish them. Each head is specially designed to learn the features optimal for the corresponding task, which allows our network to detect objects accurately. We name it MRDet short for Multi-head Rotated object Detector for convenience. We test the proposed MRDet on two challenging benchmarks, i.e., DOTA and HRSC2016, and compare it with several state-of-the-art methods. Our method achieves very promising results which clearly demonstrate its effectiveness.
LiDAR-based 3D object detection is an important task for autonomous driving and current approaches suffer from sparse and partial point clouds of distant and occluded objects. In this paper, we propose a novel two-stage approach, namely PC-RGNN, dealing with such challenges by two specific solutions. On the one hand, we introduce a point cloud completion module to recover high-quality proposals of dense points and entire views with original structures preserved. On the other hand, a graph neural network module is designed, which comprehensively captures relations among points through a local-global attention mechanism as well as multi-scale graph based context aggregation, substantially strengthening encoded features. Extensive experiments on the KITTI benchmark show that the proposed approach outperforms the previous state-of-the-art baselines by remarkable margins, highlighting its effectiveness.
Hyperspectral image classification (HIC) is an important but challenging task, and a problem that limits the algorithmic development in this field is that the ground truths of hyperspectral images (HSIs) are extremely hard to obtain. Recently a handful of HIC methods are developed based on the graph convolution networks (GCNs), which effectively relieves the scarcity of labeled data for deep learning based HIC methods. To further lift the classification performance, in this work we propose a graph convolution network (GCN) based framework for HSI classification that uses two clustering operations to better exploit multi-hop node correlations and also effectively reduce graph size. In particular, we first cluster the pixels with similar spectral features into a superpixel and build the graph based on the superpixels of the input HSI. Then instead of performing convolution over this superpixel graph, we further partition it into several sub-graphs by pruning the edges with weak weights, so as to strengthen the correlations of nodes with high similarity. This second round of clustering also further reduces the graph size, thus reducing the computation burden of graph convolution. Experimental results on three widely used benchmark datasets well prove the effectiveness of our proposed framework.
Pan-sharpening aims at fusing a low-resolution (LR) multi-spectral (MS) image and a high-resolution (HR) panchromatic (PAN) image acquired by a satellite to generate an HR MS image. Many deep learning based methods have been developed in the past few years. However, since there are no intended HR MS images as references for learning, almost all of the existing methods down-sample the MS and PAN images and regard the original MS images as targets to form a supervised setting for training. These methods may perform well on the down-scaled images, however, they generalize poorly to the full-resolution images. To conquer this problem, we design an unsupervised framework that is able to learn directly from the full-resolution images without any preprocessing. The model is built based on a novel generative multi-adversarial network. We use a two-stream generator to extract the modality-specific features from the PAN and MS images, respectively, and develop a dual-discriminator to preserve the spectral and spatial information of the inputs when performing fusion. Furthermore, a novel loss function is introduced to facilitate training under the unsupervised setting. Experiments and comparisons with other state-of-the-art methods on GaoFen-2 and QuickBird images demonstrate that the proposed method can obtain much better fusion results on the full-resolution images.
Crowd counting, which towards to accurately count the number of the objects in images, has been attracted more and more attention by researchers recently. However, challenges from severely occlusion, large scale variation, complex background interference and non-uniform density distribution, limit the crowd number estimation accuracy. To mitigate above issues, this paper proposes a novel crowd counting approach based on pyramidal scale module (PSM) and global context module (GCM), dubbed PSCNet. Moreover, a reliable supervision manner combined Bayesian and counting loss (BCL) is utilized to learn the density probability and then computes the count exception at each annotation point. Specifically, PSM is used to adaptively capture multi-scale information, which can identify a fine boundary of crowds with different image scales. GCM is devised with low-complexity and lightweight manner, to make the interactive information across the channels of the feature maps more efficient, meanwhile guide the model to select more suitable scales generated from PSM. Furthermore, BL is leveraged to construct a credible density contribution probability supervision manner, which relieves non-uniform density distribution in crowds to a certain extent. Extensive experiments on four crowd counting datasets show the effectiveness and superiority of the proposed model. Additionally, some experiments extended on a remote sensing object counting (RSOC) dataset further validate the generalization ability of the model. Our resource code will be released upon the acceptance of this work.
Graph Neural Networks (GNNs) have achieved tremendous success in graph representation learning. Unfortunately, current GNNs usually rely on loading the entire attributed graph into the network for processing. This implicit assumption may not be satisfied with limited memory resources, especially when the attributed graph is large. In this paper, we propose a Binary Graph Convolutional Network (Bi-GCN), which binarizes both the network parameters and input node features. Besides, the original matrix multiplications are revised to binary operations for accelerations. According to the theoretical analysis, our Bi-GCN can reduce the memory consumption by ~31x for both the network parameters and input data, and accelerate the inference speed by ~53x. Extensive experiments have demonstrated that our Bi-GCN can give a comparable prediction performance compared to the full-precision baselines. Besides, our binarization approach can be easily applied to other GNNs, which has been verified in the experiments.
Coarsely-labeled semantic segmentation annotations are easy to obtain, but therefore bear the risk of losing edge details and introducing background noise. Though they are usually used as a supplement to the finely-labeled ones, in this paper, we attempt to train a model only using these coarse annotations, and improve the model performance with a noise-robust reweighting strategy. Specifically, the proposed confidence indicator makes it possible to design a reweighting strategy that simultaneously mines hard samples and alleviates noisy labels for the coarse annotation. Besides, the optimal reweighting strategy can be automatically derived by our Adversarial Weight Assigning Module (AWAM) with only 53 learnable parameters. Moreover, a rigorous proof of the convergence of AWAM is given. Experiments on standard datasets show that our proposed reweighting strategy can bring consistent performance improvements for both coarse annotations and fine annotations. In particular, built on top of DeeplabV3+, we improve the mIoU on Cityscapes Coarse dataset (coarsely-labeled) and ADE20K (finely-labeled) by 2.21 and 0.91, respectively.
Object counting, whose aim is to estimate the number of objects from a given image, is an important and challenging computation task. Significant efforts have been devoted to addressing this problem and achieved great progress, yet counting the number of ground objects from remote sensing images is barely studied. In this paper, we are interested in counting dense objects from remote sensing images. Compared with object counting in a natural scene, this task is challenging in the following factors: large scale variation, complex cluttered background, and orientation arbitrariness. More importantly, the scarcity of data severely limits the development of research in this field. To address these issues, we first construct a large-scale object counting dataset with remote sensing images, which contains four important geographic objects: buildings, crowded ships in harbors, large-vehicles and small-vehicles in parking lots. We then benchmark the dataset by designing a novel neural network that can generate a density map of an input image. The proposed network consists of three parts namely attention module, scale pyramid module and deformable convolution module to attack the aforementioned challenging factors. Extensive experiments are performed on the proposed dataset and one crowd counting datset, which demonstrate the challenges of the proposed dataset and the superiority and effectiveness of our method compared with state-of-the-art methods.
Co-saliency detection aims to detect common salient objects from a group of relevant images. Some attempts have been made with the Fully Convolutional Network (FCN) framework and achieve satisfactory detection results. However, due to stacking convolution layers and pooling operation, the boundary details tend to be lost. In addition, existing models often utilize the extracted features without discrimination, leading to redundancy in representation since actually not all features are helpful to the final prediction and some even bring distraction. In this paper, we propose a co-attention module embedded FCN framework, called as Co-Attention FCN (CA-FCN). Specifically, the co-attention module is plugged into the high-level convolution layers of FCN, which can assign larger attention weights on the common salient objects and smaller ones on the background and uncommon distractors to boost final detection performance. Extensive experiments on three popular co-saliency benchmark datasets demonstrate the superiority of the proposed CA-FCN, which outperforms state-of-the-arts in most cases. Besides, the effectiveness of our new co-attention module is also validated with ablation studies.