Maintaining high efficiency and high precision are two fundamental challenges in UAV tracking due to the constraints of computing resources, battery capacity, and UAV maximum load. Discriminative correlation filters (DCF)-based trackers can yield high efficiency on a single CPU but with inferior precision. Lightweight Deep learning (DL)-based trackers can achieve a good balance between efficiency and precision but performance gains are limited by the compression rate. High compression rate often leads to poor discriminative representations. To this end, this paper aims to enhance the discriminative power of feature representations from a new feature-learning perspective. Specifically, we attempt to learn more disciminative representations with contrastive instances for UAV tracking in a simple yet effective manner, which not only requires no manual annotations but also allows for developing and deploying a lightweight model. We are the first to explore contrastive learning for UAV tracking. Extensive experiments on four UAV benchmarks, including UAV123@10fps, DTB70, UAVDT and VisDrone2018, show that the proposed DRCI tracker significantly outperforms state-of-the-art UAV tracking methods.
This paper proposes an expressive singing voice synthesis system by introducing explicit vibrato modeling and latent energy representation. Vibrato is essential to the naturalness of synthesized sound, due to the inherent characteristics of human singing. Hence, a deep learning-based vibrato model is introduced in this paper to control the vibrato's likeliness, rate, depth and phase in singing, where the vibrato likeliness represents the existence probability of vibrato and it would help improve the singing voice's naturalness. Actually, there is no annotated label about vibrato likeliness in existing singing corpus. We adopt a novel vibrato likeliness labeling method to label the vibrato likeliness automatically. Meanwhile, the power spectrogram of audio contains rich information that can improve the expressiveness of singing. An autoencoder-based latent energy bottleneck feature is proposed for expressive singing voice synthesis. Experimental results on the open dataset NUS48E show that both the vibrato modeling and the latent energy representation could significantly improve the expressiveness of singing voice. The audio samples are shown in the demo website.
Semantic segmentation is important for scene understanding. To address the scenes of adverse illumination conditions of natural images, thermal infrared (TIR) images are introduced. Most existing RGB-T semantic segmentation methods follow three cross-modal fusion paradigms, i.e. encoder fusion, decoder fusion, and feature fusion. Some methods, unfortunately, ignore the properties of RGB and TIR features or the properties of features at different levels. In this paper, we propose a novel feature fusion-based network for RGB-T semantic segmentation, named \emph{LASNet}, which follows three steps of location, activation, and sharpening. The highlight of LASNet is that we fully consider the characteristics of cross-modal features at different levels, and accordingly propose three specific modules for better segmentation. Concretely, we propose a Collaborative Location Module (CLM) for high-level semantic features, aiming to locate all potential objects. We propose a Complementary Activation Module for middle-level features, aiming to activate exact regions of different objects. We propose an Edge Sharpening Module (ESM) for low-level texture features, aiming to sharpen the edges of objects. Furthermore, in the training phase, we attach a location supervision and an edge supervision after CLM and ESM, respectively, and impose two semantic supervisions in the decoder part to facilitate network convergence. Experimental results on two public datasets demonstrate that the superiority of our LASNet over relevant state-of-the-art methods. The code and results of our method are available at https://github.com/MathLee/LASNet.
Human face images usually appear with wide range of visual scales. The existing face representations pursue the bandwidth of handling scale variation via multi-scale scheme that assembles a finite series of predefined scales. Such multi-shot scheme brings inference burden, and the predefined scales inevitably have gap from real data. Instead, learning scale parameters from data, and using them for one-shot feature inference, is a decent solution. To this end, we reform the conv layer by resorting to the scale-space theory, and achieve two-fold facilities: 1) the conv layer learns a set of scales from real data distribution, each of which is fulfilled by a conv kernel; 2) the layer automatically highlights the feature at the proper channel and location corresponding to the input pattern scale and its presence. Then, we accomplish the hierarchical scale attention by stacking the reformed layers, building a novel style named SCale AttentioN Conv Neural Network (\textbf{SCAN-CNN}). We apply SCAN-CNN to the face recognition task and push the frontier of SOTA performance. The accuracy gain is more evident when the face images are blurry. Meanwhile, as a single-shot scheme, the inference is more efficient than multi-shot fusion. A set of tools are made to ensure the fast training of SCAN-CNN and zero increase of inference cost compared with the plain CNN.
Human pose estimation aims to figure out the keypoints of all people in different scenes. Current approaches still face some challenges despite promising results. Existing top-down methods deal with a single person individually, without the interaction between different people and the scene they are situated in. Consequently, the performance of human detection degrades when serious occlusion happens. On the other hand, existing bottom-up methods consider all people at the same time and capture the global knowledge of the entire image. However, they are less accurate than the top-down methods due to the scale variation. To address these problems, we propose a novel Dual-Pipeline Integrated Transformer (DPIT) by integrating top-down and bottom-up pipelines to explore the visual clues of different receptive fields and achieve their complementarity. Specifically, DPIT consists of two branches, the bottom-up branch deals with the whole image to capture the global visual information, while the top-down branch extracts the feature representation of local vision from the single-human bounding box. Then, the extracted feature representations from bottom-up and top-down branches are fed into the transformer encoder to fuse the global and local knowledge interactively. Moreover, we define the keypoint queries to explore both full-scene and single-human posture visual clues to realize the mutual complementarity of the two pipelines. To the best of our knowledge, this is one of the first works to integrate the bottom-up and top-down pipelines with transformers for human pose estimation. Extensive experiments on COCO and MPII datasets demonstrate that our DPIT achieves comparable performance to the state-of-the-art methods.
Vascular segmentation extracts blood vessels from images and serves as the basis for diagnosing various diseases, like ophthalmic diseases. Ophthalmologists often require high-resolution segmentation results for analysis, which leads to super-computational load by most existing methods. If based on low-resolution input, they easily ignore tiny vessels or cause discontinuity of segmented vessels. To solve these problems, the paper proposes an algorithm named SuperVessel, which gives out high-resolution and accurate vessel segmentation using low-resolution images as input. We first take super-resolution as our auxiliary branch to provide potential high-resolution detail features, which can be deleted in the test phase. Secondly, we propose two modules to enhance the features of the interested segmentation region, including an upsampling with feature decomposition (UFD) module and a feature interaction module (FIM) with a constraining loss to focus on the interested features. Extensive experiments on three publicly available datasets demonstrate that our proposed SuperVessel can segment more tiny vessels with higher segmentation accuracy IoU over 6%, compared with other state-of-the-art algorithms. Besides, the stability of SuperVessel is also stronger than other algorithms. We will release the code after the paper is published.
While the abuse of deepfake technology has caused serious concerns recently, how to detect deepfake videos is still a challenge due to the high photo-realistic synthesis of each frame. Existing image-level approaches often focus on single frame and ignore the spatiotemporal cues hidden in deepfake videos, resulting in poor generalization and robustness. The key of a video-level detector is to fully exploit the spatiotemporal inconsistency distributed in local facial regions across different frames in deepfake videos. Inspired by that, this paper proposes a simple yet effective patch-level approach to facilitate deepfake video detection via spatiotemporal dropout transformer. The approach reorganizes each input video into bag of patches that is then fed into a vision transformer to achieve robust representation. Specifically, a spatiotemporal dropout operation is proposed to fully explore patch-level spatiotemporal cues and serve as effective data augmentation to further enhance model's robustness and generalization ability. The operation is flexible and can be easily plugged into existing vision transformers. Extensive experiments demonstrate the effectiveness of our approach against 25 state-of-the-arts with impressive robustness, generalizability, and representation ability.
Unmanned aerial vehicle (UAV) tracking has wide potential applications in such as agriculture, navigation, and public security. However, the limitations of computing resources, battery capacity, and maximum load of UAV hinder the deployment of deep learning-based tracking algorithms on UAV. Consequently, discriminative correlation filters (DCF) trackers stand out in the UAV tracking community because of their high efficiency. However, their precision is usually much lower than trackers based on deep learning. Model compression is a promising way to narrow the gap (i.e., effciency, precision) between DCF- and deep learning- based trackers, which has not caught much attention in UAV tracking. In this paper, we propose the P-SiamFC++ tracker, which is the first to use rank-based filter pruning to compress the SiamFC++ model, achieving a remarkable balance between efficiency and precision. Our method is general and may encourage further studies on UAV tracking with model compression. Extensive experiments on four UAV benchmarks, including UAV123@10fps, DTB70, UAVDT and Vistrone2018, show that P-SiamFC++ tracker significantly outperforms state-of-the-art UAV tracking methods.
The development of Internet technology has continuously intensified the spread and destructive power of rumors and fake news. Previous researches on multimedia fake news detection include a series of complex feature extraction and fusion networks to achieve feature alignment between images and texts. However, what the multimodal features are composed of and how features from different modalities affect the decision-making process are still open questions. We present AURA, a multimodal fake news detection network with Adaptive Unimodal Representation Aggregation. We first extract representations respectively from image pattern, image semantics and text, and multimodal representations are generated by sending the semantic and linguistic representations into an expert network. Then, we perform coarse-level fake news detection and cross-modal cosistency learning according to the unimodal and multimodal representations. The classification and consistency scores are mapped into modality-aware attention scores that readjust the features. Finally, we aggregation and classify the weighted features for refined fake news detection. Comprehensive experiments on Weibo and Gossipcop prove that AURA can successfully beat several state-of-the-art FND schemes, where the overall prediction accuracy and the recall of fake news is steadily improved.
As a representative self-supervised method, contrastive learning has achieved great successes in unsupervised training of representations. It trains an encoder by distinguishing positive samples from negative ones given query anchors. These positive and negative samples play critical roles in defining the objective to learn the discriminative encoder, avoiding it from learning trivial features. While existing methods heuristically choose these samples, we present a principled method where both positive and negative samples are directly learnable end-to-end with the encoder. We show that the positive and negative samples can be cooperatively and adversarially learned by minimizing and maximizing the contrastive loss, respectively. This yields cooperative positives and adversarial negatives with respect to the encoder, which are updated to continuously track the learned representation of the query anchors over mini-batches. The proposed method achieves 71.3% and 75.3% in top-1 accuracy respectively over 200 and 800 epochs of pre-training ResNet-50 backbone on ImageNet1K without tricks such as multi-crop or stronger augmentations. With Multi-Crop, it can be further boosted into 75.7%. The source code and pre-trained model are released in https://github.com/maple-research-lab/caco.