In recent years, deep learning based methods have achieved promising performance in standard object detection. However, these methods lack sufficient capabilities to handle underwater object detection due to these challenges: (1) Objects in real applications are usually small and their images are blurry, and (2) images in the underwater datasets and real applications accompany heterogeneous noise. To address these two problems, we first propose a novel neural network architecture, namely Sample-WeIghted hyPEr Network (SWIPENet), for small object detection. SWIPENet consists of high resolution and semantic rich Hyper Feature Maps which can significantly improve small object detection accuracy. In addition, we propose a novel sample-weighted loss function which can model sample weights for SWIPENet, which uses a novel sample re-weighting algorithm, namely Invert Multi-Class Adaboost (IMA), to reduce the influence of noise on the proposed SWIPENet. Experiments on two underwater robot picking contest datasets URPC2017 and URPC2018 show that the proposed SWIPENet+IMA framework achieves better performance in detection accuracy against several state-of-the-art object detection approaches.
Current semantic segmentation models only exploit first-order statistics, while rarely exploring high-order statistics. However, common first-order statistics are insufficient to support a solid unanimous representation. In this paper, we propose High-Order Paired-ASPP Network to exploit high-order statistics from various feature levels. The network first introduces a High-Order Representation module to extract the contextual high-order information from all stages of the backbone. They can provide more semantic clues and discriminative information than the first-order ones. Besides, a Paired-ASPP module is proposed to embed high-order statistics of the early stages into the last stage. It can further preserve the boundary-related and spatial context in the low-level features for final prediction. Our experiments show that the high-order statistics significantly boost the performance on confusing objects. Our method achieves competitive performance without bells and whistles on three benchmarks, i.e, Cityscapes, ADE20K and Pascal-Context with the mIoU of 81.6%, 45.3% and 52.9%.
Low-shot learning indicates the ability to recognize unseen objects based on very limited labeled training samples, which simulates human visual intelligence. According to this concept, we propose a multi-level similarity model (MLSM) to capture the deep encoded distance metric between the support and query samples. Our approach is achieved based on the fact that the image similarity learning can be decomposed into image-level, global-level, and object-level. Once the similarity function is established, MLSM will be able to classify images for unseen classes by computing the similarity scores between a limited number of labeled samples and the target images. Furthermore, we conduct 5-way experiments with both 1-shot and 5-shot setting on Caltech-UCSD datasets. It is demonstrated that the proposed model can achieve promising results compared with the existing methods in practical applications.
Learning to recognize novel visual categories from a few examples is a challenging task for machines in real-world applications. In contrast, humans have the ability to discriminate even similar objects with little supervision. This paper attempts to address the few-shot fine-grained recognition problem. We propose a feature fusion model to explore the largest discriminative features by focusing on key regions. The model utilizes focus-area location to discover the perceptually similar regions among objects. High-order integration is employed to capture the interaction information among intra-parts. We also design a Center Neighbor Loss to form robust embedding space distribution for generating discriminative features. Furthermore, we build a typical fine-grained and few-shot learning dataset miniPPlankton from the real-world application in the area of marine ecological environment. Extensive experiments are carried out to validate the performance of our model. First, the model is evaluated with two challenging experiments based on the miniDogsNet and Caltech-UCSD public datasets. The results demonstrate that our model achieves competitive performance compared with state-of-the-art models. Then, we implement our model for the real-world phytoplankton recognition task. The experimental results show the superiority of the proposed model compared with others on the miniPPlankton dataset.
High storage and computational costs obstruct deep neural networks to be deployed on resource-constrained devices. Knowledge distillation aims to train a compact student network by transferring knowledge from a larger pre-trained teacher model. However, most existing methods on knowledge distillation ignore the valuable information among training process associated with training results. In this paper, we provide a new Collaborative Teaching Knowledge Distillation (CTKD) strategy which employs two special teachers. Specifically, one teacher trained from scratch (i.e., scratch teacher) assists the student step by step using its temporary outputs. It forces the student to approach the optimal path towards the final logits with high accuracy. The other pre-trained teacher (i.e., expert teacher) guides the student to focus on a critical region which is more useful for the task. The combination of the knowledge from two special teachers can significantly improve the performance of the student network in knowledge distillation. The results of experiments on CIFAR-10, CIFAR-100, SVHN and Tiny ImageNet datasets verify that the proposed knowledge distillation method is efficient and achieves state-of-the-art performance.
Obtaining dense 3D reconstrution with low computational cost is one of the important goals in the field of SLAM. In this paper we propose a dense 3D reconstruction framework from monocular multispectral video sequences using jointly semi-dense SLAM and Multispectral Photometric Stereo approaches. Starting from multispectral video, SALM (a) reconstructs a semi-dense 3D shape that will be densified;(b) recovers relative sparse depth map that is then fed as prioris into optimization-based multispectral photometric stereo for a more accurate dense surface normal recovery;(c)obtains camera pose that is subsequently used for conversion of view in the process of fusion where we combine the relative sparse point cloud with the dense surface normal using the automated cross-scale fusion method proposed in this paper to get a dense point cloud with subtle texture information. Experiments show that our method can effectively obtain denser 3D reconstructions.
Fine-grained recognition is a challenging task due to the small intra-category variances. Most of top-performing fine-grained recognition methods leverage parts of objects for better performance. Therefore, part annotations which are extremely computationally expensive are required. In this paper, we propose a novel cascaded deep CNN detection framework for fine-grained recognition which is trained to detect the whole object without considering parts. Nevertheless, most of current top-performing detection networks use the N+1 class (N object categories plus background) softmax loss, and the background category with much more training samples dominates the feature learning progress so that the features are not good for object categories with fewer samples. To bridge this gap, we introduce a cascaded structure to eliminate background and exploit a one-vs-rest loss to capture more minute variances among different subordinate categories. Experiments show that our proposed recognition framework achieves comparable performance with state-of-the-art, part-free, fine-grained recognition methods on the CUB-200-2011 Bird dataset. Moreover, our method even outperforms most of part-based methods while does not need part annotations at the training stage and is free from any annotations at test stage.
This letter adopts long short-term memory(LSTM) to predict sea surface temperature(SST), which is the first attempt, to our knowledge, to use recurrent neural network to solve the problem of SST prediction, and to make one week and one month daily prediction. We formulate the SST prediction problem as a time series regression problem. LSTM is a special kind of recurrent neural network, which introduces gate mechanism into vanilla RNN to prevent the vanished or exploding gradient problem. It has strong ability to model the temporal relationship of time series data and can handle the long-term dependency problem well. The proposed network architecture is composed of two kinds of layers: LSTM layer and full-connected dense layer. LSTM layer is utilized to model the time series relationship. Full-connected layer is utilized to map the output of LSTM layer to a final prediction. We explore the optimal setting of this architecture by experiments and report the accuracy of coastal seas of China to confirm the effectiveness of the proposed method. In addition, we also show its online updated characteristics.
With the increasing demands of applications in virtual reality such as 3D films, virtual Human-Machine Interactions and virtual agents, the analysis of 3D human face analysis is considered to be more and more important as a fundamental step for those virtual reality tasks. Due to information provided by an additional dimension, 3D facial reconstruction enables aforementioned tasks to be achieved with higher accuracy than those based on 2D facial analysis. The denser the 3D facial model is, the more information it could provide. However, most existing dense 3D facial reconstruction methods require complicated processing and high system cost. To this end, this paper presents a novel method that simplifies the process of dense 3D facial reconstruction by employing only one frame of depth data obtained with an off-the-shelf RGB-D sensor. The experiments showed competitive results with real world data.
Procedural textures are normally generated from mathematical models with parameters carefully selected by experienced users. However, for naive users, the intuitive way to obtain a desired texture is to provide semantic descriptions such as "regular," "lacelike," and "repetitive" and then a procedural model with proper parameters will be automatically suggested to generate the corresponding textures. By contrast, it is less practical for users to learn mathematical models and tune parameters based on multiple examinations of large numbers of generated textures. In this study, we propose a novel framework that generates procedural textures according to user-defined semantic descriptions, and we establish a mapping between procedural models and semantic texture descriptions. First, based on a vocabulary of semantic attributes collected from psychophysical experiments, a multi-label learning method is employed to annotate a large number of textures with semantic attributes to form a semantic procedural texture dataset. Then, we derive a low dimensional semantic space in which the semantic descriptions can be separated from one other. Finally, given a set of semantic descriptions, the diverse properties of the samples in the semantic space can lead the framework to find an appropriate generation model that uses appropriate parameters to produce a desired texture. The experimental results show that the proposed framework is effective and that the generated textures closely correlate with the input semantic descriptions.