Pansharpening, a pivotal task in remote sensing, involves integrating low-resolution multispectral images with high-resolution panchromatic images to synthesize an image that is both high-resolution and retains multispectral information. These pansharpened images enhance precision in land cover classification, change detection, and environmental monitoring within remote sensing data analysis. While deep learning techniques have shown significant success in pansharpening, existing methods often face limitations in their evaluation, focusing on restricted satellite data sources, single scene types, and low-resolution images. This paper addresses this gap by introducing PanBench, a high-resolution multi-scene dataset containing all mainstream satellites and comprising 5,898 pairs of samples. Each pair includes a four-channel (RGB + near-infrared) multispectral image of 256x256 pixels and a mono-channel panchromatic image of 1,024x1,024 pixels. To achieve high-fidelity synthesis, we propose a Cascaded Multiscale Fusion Network (CMFNet) for Pansharpening. Extensive experiments validate the effectiveness of CMFNet. We have released the dataset, source code, and pre-trained models in the supplementary, fostering further research in remote sensing.
Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through separate branches and distinct output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level segmentation results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations. By virtual of unifying instance-level and category-level output, UniParser circumvents manually designed post-processing techniques and surpasses state-of-the-art methods, achieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP. We will release our source code, pretrained models, and online demos to facilitate future studies.
Most existing RGB-T tracking networks extract modality features in a separate manner, which lacks interaction and mutual guidance between modalities. This limits the network's ability to adapt to the diverse dual-modality appearances of targets and the dynamic relationships between the modalities. Additionally, the three-stage fusion tracking paradigm followed by these networks significantly restricts the tracking speed. To overcome these problems, we propose a unified single-stage Transformer RGB-T tracking network, namely USTrack, which unifies the above three stages into a single ViT (Vision Transformer) backbone with a dual embedding layer through self-attention mechanism. With this structure, the network can extract fusion features of the template and search region under the mutual interaction of modalities. Simultaneously, relation modeling is performed between these features, efficiently obtaining the search region fusion features with better target-background discriminability for prediction. Furthermore, we introduce a novel feature selection mechanism based on modality reliability to mitigate the influence of invalid modalities for prediction, further improving the tracking performance. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance while maintaining the fastest inference speed 84.2FPS. In particular, MPR/MSR on the short-term and long-term subsets of VTUAV dataset increased by 11.1$\%$/11.7$\%$ and 11.3$\%$/9.7$\%$.
The extraction of lakes from remote sensing images is a complex challenge due to the varied lake shapes and data noise. Current methods rely on multispectral image datasets, making it challenging to learn lake features accurately from pixel arrangements. This, in turn, affects model learning and the creation of accurate segmentation masks. This paper introduces a unified prompt-based dataset construction approach that provides approximate lake locations using point, box, and mask prompts. We also propose a two-stage prompt enhancement framework, LEPrompter, which involves prompt-based and prompt-free stages during training. The prompt-based stage employs a prompt encoder to extract prior information, integrating prompt tokens and image embeddings through self- and cross-attention in the prompt decoder. Prompts are deactivated once the model is trained to ensure independence during inference, enabling automated lake extraction. Evaluations on Surface Water and Qinghai-Tibet Plateau Lake datasets show consistent performance improvements compared to the previous state-of-the-art method. LEPrompter achieves mIoU scores of 91.48% and 97.43% on the respective datasets without introducing additional parameters or GFLOPs. Supplementary materials provide the source code, pre-trained models, and detailed user studies.
Creating realistic 3D facial animation is crucial for various applications in the movie production and gaming industry, especially with the burgeoning demand in the metaverse. However, prevalent methods such as blendshape-based approaches and facial rigging techniques are time-consuming, labor-intensive, and lack standardized configurations, making facial animation production challenging and costly. In this paper, we propose a novel self-supervised framework, Versatile Face Animator, which combines facial motion capture with motion retargeting in an end-to-end manner, eliminating the need for blendshapes or rigs. Our method has the following two main characteristics: 1) we propose an RGBD animation module to learn facial motion from raw RGBD videos by hierarchical motion dictionaries and animate RGBD images rendered from 3D facial mesh coarse-to-fine, enabling facial animation on arbitrary 3D characters regardless of their topology, textures, blendshapes, and rigs; and 2) we introduce a mesh retarget module to utilize RGBD animation to create 3D facial animation by manipulating facial mesh with controller transformations, which are estimated from dense optical flow fields and blended together with geodesic-distance-based weights. Comprehensive experiments demonstrate the effectiveness of our proposed framework in generating impressive 3D facial animation results, highlighting its potential as a promising solution for the cost-effective and efficient production of facial animation in the metaverse.
Human hands, the primary means of non-verbal communication, convey intricate semantics in various scenarios. Due to the high sensitivity of individuals to hand motions, even minor errors in hand motions can significantly impact the user experience. Real applications often involve multiple avatars with varying hand shapes, highlighting the importance of maintaining the intricate semantics of hand motions across the avatars. Therefore, this paper aims to transfer the hand motion semantics between diverse avatars based on their respective hand models. To address this problem, we introduce a novel anatomy-based semantic matrix (ASM) that encodes the semantics of hand motions. The ASM quantifies the positions of the palm and other joints relative to the local frame of the corresponding joint, enabling precise retargeting of hand motions. Subsequently, we obtain a mapping function from the source ASM to the target hand joint rotations by employing an anatomy-based semantics reconstruction network (ASRN). We train the ASRN using a semi-supervised learning strategy on the Mixamo and InterHand2.6M datasets. We evaluate our method in intra-domain and cross-domain hand motion retargeting tasks. The qualitative and quantitative results demonstrate the significant superiority of our ASRN over the state-of-the-arts.
Speech-driven 3D face animation poses significant challenges due to the intricacy and variability inherent in human facial movements. This paper emphasizes the importance of considering both the composite and regional natures of facial movements in speech-driven 3D face animation. The composite nature pertains to how speech-independent factors globally modulate speech-driven facial movements along the temporal dimension. Meanwhile, the regional nature alludes to the notion that facial movements are not globally correlated but are actuated by local musculature along the spatial dimension. It is thus indispensable to incorporate both natures for engendering vivid animation. To address the composite nature, we introduce an adaptive modulation module that employs arbitrary facial movements to dynamically adjust speech-driven facial movements across frames on a global scale. To accommodate the regional nature, our approach ensures that each constituent of the facial features for every frame focuses on the local spatial movements of 3D faces. Moreover, we present a non-autoregressive backbone for translating audio to 3D facial movements, which maintains high-frequency nuances of facial movements and facilitates efficient inference. Comprehensive experiments and user studies demonstrate that our method surpasses contemporary state-of-the-art approaches both qualitatively and quantitatively.
Optical satellite images are a critical data source; however, cloud cover often compromises their quality, hindering image applications and analysis. Consequently, effectively removing clouds from optical satellite images has emerged as a prominent research direction. While recent advancements in cloud removal primarily rely on generative adversarial networks, which may yield suboptimal image quality, diffusion models have demonstrated remarkable success in diverse image-generation tasks, showcasing their potential in addressing this challenge. This paper presents a novel framework called DiffCR, which leverages conditional guided diffusion with deep convolutional networks for high-performance cloud removal for optical satellite imagery. Specifically, we introduce a decoupled encoder for conditional image feature extraction, providing a robust color representation to ensure the close similarity of appearance information between the conditional input and the synthesized output. Moreover, we propose a novel and efficient time and condition fusion block within the cloud removal model to accurately simulate the correspondence between the appearance in the conditional image and the target image at a low computational cost. Extensive experimental evaluations on two commonly used benchmark datasets demonstrate that DiffCR consistently achieves state-of-the-art performance on all metrics, with parameter and computational complexities amounting to only 5.1% and 5.4%, respectively, of those previous best methods. The source code, pre-trained models, and all the experimental results will be publicly available at https://github.com/XavierJiezou/DiffCR upon the paper's acceptance of this work.
Recent score-based diffusion models (SBDMs) show promising results in unpaired image-to-image translation (I2I). However, existing methods, either energy-based or statistically-based, provide no explicit form of the interfered intermediate generative distributions. This work presents a new score-decomposed diffusion model (SDDM) on manifolds to explicitly optimize the tangled distributions during image generation. SDDM derives manifolds to make the distributions of adjacent time steps separable and decompose the score function or energy guidance into an image ``denoising" part and a content ``refinement" part. To refine the image in the same noise level, we equalize the refinement parts of the score function and energy guidance, which permits multi-objective optimization on the manifold. We also leverage the block adaptive instance normalization module to construct manifolds with lower dimensions but still concentrated with the perturbed reference image. SDDM outperforms existing SBDM-based methods with much fewer diffusion steps on several I2I benchmarks.
Unmanned Aerial Vehicles (UAVs) have been widely used in many areas, including transportation, surveillance, and military. However, their potential for safety and privacy violations is an increasing issue and highly limits their broader applications, underscoring the critical importance of UAV perception and defense (anti-UAV). Still, previous works have simplified such an anti-UAV task as a tracking problem, where the prior information of UAVs is always provided; such a scheme fails in real-world anti-UAV tasks (i.e. complex scenes, indeterminate-appear and -reappear UAVs, and real-time UAV surveillance). In this paper, we first formulate a new and practical anti-UAV problem featuring the UAVs perception in complex scenes without prior UAVs information. To benchmark such a challenging task, we propose the largest UAV dataset dubbed AntiUAV600 and a new evaluation metric. The AntiUAV600 comprises 600 video sequences of challenging scenes with random, fast, and small-scale UAVs, with over 723K thermal infrared frames densely annotated with bounding boxes. Finally, we develop a novel anti-UAV approach via an evidential collaboration of global UAVs detection and local UAVs tracking, which effectively tackles the proposed problem and can serve as a strong baseline for future research. Extensive experiments show our method outperforms SOTA approaches and validate the ability of AntiUAV600 to enhance UAV perception performance due to its large scale and complexity. Our dataset, pretrained models, and source codes will be released publically.