In the past decade, object detection has achieved significant progress in natural images but not in aerial images, due to the massive variations in the scale and orientation of objects caused by the bird's-eye view of aerial images. More importantly, the lack of large-scale benchmarks becomes a major obstacle to the development of object detection in aerial images (ODAI). In this paper, we present a large-scale Dataset of Object deTection in Aerial images (DOTA) and comprehensive baselines for ODAI. The proposed DOTA dataset contains 1,793,658 object instances of 18 categories of oriented-bounding-box annotations collected from 11,268 aerial images. Based on this large-scale and well-annotated dataset, we build baselines covering 10 state-of-the-art algorithms with over 70 configurations, where the speed and accuracy performances of each model have been evaluated. Furthermore, we provide a uniform code library for ODAI and build a website for testing and evaluating different algorithms. Previous challenges run on DOTA have attracted more than 1300 teams worldwide. We believe that the expanded large-scale DOTA dataset, the extensive baselines, the code library and the challenges can facilitate the designs of robust algorithms and reproducible research on the problem of object detection in aerial images.
Semantic segmentation for aerial platforms has been one of the fundamental scene understanding task for the earth observation. Most of the semantic segmentation research focused on scenes captured in nadir view, in which objects have relatively smaller scale variation compared with scenes captured in oblique view. The huge scale variation of objects in oblique images limits the performance of deep neural networks (DNN) that process images in a single scale fashion. In order to tackle the scale variation issue, in this paper, we propose the novel bidirectional multi-scale attention networks, which fuse features from multiple scales bidirectionally for more adaptive and effective feature extraction. The experiments are conducted on the UAVid2020 dataset and have shown the effectiveness of our method. Our model achieved the state-of-the-art (SOTA) result with a mean intersection over union (mIoU) score of 70.80%.
This paper presents a context-aware tracing strategy (CATS) for crisp edge detection with deep edge detectors, based on an observation that the localization ambiguity of deep edge detectors is mainly caused by the mixing phenomenon of convolutional neural networks: feature mixing in edge classification and side mixing during fusing side predictions. The CATS consists of two modules: a novel tracing loss that performs feature unmixing by tracing boundaries for better side edge learning, and a context-aware fusion block that tackles the side mixing by aggregating the complementary merits of learned side edges. Experiments demonstrate that the proposed CATS can be integrated into modern deep edge detectors to improve localization accuracy. With the vanilla VGG16 backbone, in terms of BSDS500 dataset, our CATS improves the F-measure (ODS) of the RCF and BDCN deep edge detectors by 12% and 6% respectively when evaluating without using the morphological non-maximal suppression scheme for edge detection.
Given two multi-temporal aerial images, semantic change detection aims to locate the land-cover variations and identify their categories with pixel-wise boundaries. The problem has demonstrated promising potentials in many earth vision related tasks, such as precise urban planning and natural resource management. Existing state-of-the-art algorithms mainly identify the changed pixels through symmetric modules, which would suffer from categorical ambiguity caused by changes related to totally different land-cover distributions. In this paper, we present an asymmetric siamese network (ASN) to locate and identify semantic changes through feature pairs obtained from modules of widely different structures, which involve different spatial ranges and quantities of parameters to factor in the discrepancy across different land-cover distributions. To better train and evaluate our model, we create a large-scale well-annotated SEmantic Change detectiON Dataset (SECOND), while an adaptive threshold learning (ATL) module and a separated kappa (SeK) coefficient are proposed to alleviate the influences of label imbalance in model training and evaluation. The experimental results demonstrate that the proposed model can stably outperform the state-of-the-art algorithms with different encoder backbones.
Denoising images contaminated by the mixture of additive white Gaussian noise (AWGN) and impulse noise (IN) is an essential but challenging problem. The presence of impulsive disturbances inevitably affects the distribution of noises and thus largely degrades the performance of traditional AWGN denoisers. Existing methods target to compensate the effects of IN by introducing a weighting matrix, which, however, is lack of proper priori and thus hard to be accurately estimated. To address this problem, we exploit the Pareto distribution as the priori of the weighting matrix, based on which an accurate and robust weight estimator is proposed for mixed noise removal. Particularly, a relatively small portion of pixels are assumed to be contaminated with IN, which should have weights with small values and then be penalized out. This phenomenon can be properly described by the Pareto distribution of type 1. Therefore, armed with the Pareto distribution, we formulate the problem of mixed noise removal in the Bayesian framework, where nonlocal self-similarity priori is further exploited by adopting nonlocal low rank approximation. Compared to existing methods, the proposed method can estimate the weighting matrix adaptively, accurately, and robust for different level of noises, thus can boost the denoising performance. Experimental results on widely used image datasets demonstrate the superiority of our proposed method to the state-of-the-arts.
The past decade has witnessed significant progress on detecting objects in aerial images that are often distributed with large scale variations and arbitrary orientations. However most of existing methods rely on heuristically defined anchors with different scales, angles and aspect ratios and usually suffer from severe misalignment between anchor boxes and axis-aligned convolutional features, which leads to the common inconsistency between the classification score and localization accuracy. To address this issue, we propose a Single-shot Alignment Network (S$^2$A-Net) consisting of two modules: a Feature Alignment Module (FAM) and an Oriented Detection Module (ODM). The FAM can generate high-quality anchors with an Anchor Refinement Network and adaptively align the convolutional features according to the anchor boxes with a novel Alignment Convolution. The ODM first adopts active rotating filters to encode the orientation information and then produces orientation-sensitive and orientation-invariant features to alleviate the inconsistency between classification score and localization accuracy. Besides, we further explore the approach to detect objects in large-size images, which leads to a better trade-off between speed and accuracy. Extensive experiments demonstrate that our method can achieve state-of-the-art performance on two commonly used aerial objects datasets (i.e., DOTA and HRSC2016) while keeping high efficiency. The code is available at https://github.com/csuhan/s2anet.
With extremely high temporal resolution, event cameras have a large potential for robotics and computer vision. However, their asynchronous imaging mechanism often aggravates the measurement sensitivity to noises and brings a physical burden to increase the image spatial resolution. To recover high-quality intensity images, one should address both denoising and super-resolution problems for event cameras. Since events depict brightness changes, with the enhanced degeneration model by the events, the clear and sharp high-resolution latent images can be recovered from the noisy, blurry and low-resolution intensity observations. Exploiting the framework of sparse learning, the events and the low-resolution intensity observations can be jointly considered. Based on this, we propose an explainable network, an event-enhanced sparse learning network (eSL-Net), to recover the high-quality images from event cameras. After training with a synthetic dataset, the proposed eSL-Net can largely improve the performance of the state-of-the-art by 7-12 dB. Furthermore, without additional training process, the proposed eSL-Net can be easily extended to generate continuous frames with frame-rate as high as the events.
Deep convolutional neural networks (CNN) have been applied for image dehazing tasks, where the residual network (ResNet) is often adopted as the basic component to avoid the vanishing gradient problem. Recently, many works indicate that the ResNet can be considered as the explicit Euler forward approximation of an ordinary differential equation (ODE). In this paper, we extend the explicit forward approximation to the implicit backward counterpart, which can be realized via a recursive neural network, named IM-block. Given that, we propose an efficient end-to-end multi-level implicit network (MI-Net) for the single image dehazing problem. Moreover, multi-level fusing (MLF) mechanism and residual channel attention block (RCA-block) are adopted to boost performance of our network. Experiments on several dehazing benchmark datasets demonstrate that our method outperforms existing methods and achieves the state-of-the-art performance.
This paper addresses the problem of semi-supervised transfer learning with limited cross-modality data in remote sensing. A large amount of multi-modal earth observation images, such as multispectral imagery (MSI) or synthetic aperture radar (SAR) data, are openly available on a global scale, enabling parsing global urban scenes through remote sensing imagery. However, their ability in identifying materials (pixel-wise classification) remains limited, due to the noisy collection environment and poor discriminative information as well as limited number of well-annotated training images. To this end, we propose a novel cross-modal deep-learning framework, called X-ModalNet, with three well-designed modules: self-adversarial module, interactive learning module, and label propagation module, by learning to transfer more discriminative information from a small-scale hyperspectral image (HSI) into the classification task using a large-scale MSI or SAR data. Significantly, X-ModalNet generalizes well, owing to propagating labels on an updatable graph constructed by high-level features on the top of the network, yielding semi-supervised cross-modality learning. We evaluate X-ModalNet on two multi-modal remote sensing datasets (HSI-MSI and HSI-SAR) and achieve a significant improvement in comparison with several state-of-the-art methods.