Knowledge distillation-based anomaly detection methods generate same outputs for unknown classes due to the symmetric form of the input and ignore the powerful semantic information of the output of the teacher network since it is only used as a "reference standard". Towards this end, this work proposes a novel Asymmetric Distillation Post-Segmentation (ADPS) method to effectively explore the asymmetric structure of the input and the discriminative features of the teacher network. Specifically, a simple yet effective asymmetric input approach is proposed to make different data flows through the teacher and student networks. The student network enables to have different inductive and expressive abilities, which can generate different outputs in anomalous regions. Besides, to further explore the semantic information of the teacher network and obtain effective discriminative boundaries, the Weight Mask Block (WMB) and the post-segmentation module are proposede. WMB leverages a weighted strategy by exploring teacher-student feature maps to highlight anomalous features. The post-segmentation module further learns the anomalous features and obtains valid discriminative boundaries. Experimental results on three benchmark datasets demonstrate that the proposed ADPS achieves state-of-the-art anomaly segmentation results.
Story visualization advances the traditional text-to-image generation by enabling multiple image generation based on a complete story. This task requires machines to 1) understand long text inputs and 2) produce a globally consistent image sequence that illustrates the contents of the story. A key challenge of consistent story visualization is to preserve characters that are essential in stories. To tackle the challenge, we propose to adapt a recent work that augments Vector-Quantized Variational Autoencoders (VQ-VAE) with a text-tovisual-token (transformer) architecture. Specifically, we modify the text-to-visual-token module with a two-stage framework: 1) character token planning model that predicts the visual tokens for characters only; 2) visual token completion model that generates the remaining visual token sequence, which is sent to VQ-VAE for finalizing image generations. To encourage characters to appear in the images, we further train the two-stage framework with a character-token alignment objective. Extensive experiments and evaluations demonstrate that the proposed method excels at preserving characters and can produce higher quality image sequences compared with the strong baselines. Codes can be found in https://github.com/sairin1202/VP-CSV
Vessel segmentation is essential in many medical image applications, such as the detection of coronary stenoses, retinal vessel diseases and brain aneurysms. A high pixel-wise accuracy, complete topology structure and robustness to various contrast variations are three critical aspects of vessel segmentation. However, most existing methods only focus on achieving part of them via dedicated designs while few of them can concurrently achieve the three goals. In this paper, we present a novel affinity feature strengthening network (AFN) which adopts a contrast-insensitive approach based on multiscale affinity to jointly model topology and refine pixel-wise segmentation features. Specifically, for each pixel we derive a multiscale affinity field which captures the semantic relationships of the pixel with its neighbors on the predicted mask image. Such a multiscale affinity field can effectively represent the local topology of a vessel segment of different sizes. Meanwhile, it does not depend on image intensities and hence is robust to various illumination and contrast changes. We further learn spatial- and scale-aware adaptive weights for the corresponding affinity fields to strengthen vessel features. We evaluate our AFN on four different types of vascular datasets: X-ray angiography coronary vessel dataset (XCAD), portal vein dataset (PV), digital subtraction angiography cerebrovascular vessel dataset (DSA) and retinal vessel dataset (DRIVE). Extensive experimental results on the four datasets demonstrate that our AFN outperforms the state-of-the-art methods in terms of both higher accuracy and topological metrics, and meanwhile is more robust to various contrast changes than existing methods. Codes will be made public.
Image super-resolution (SR) is a field in computer vision that focuses on reconstructing high-resolution images from the respective low-resolution image. However, super-resolution is a well-known ill-posed problem as most methods rely on the downsampling method performed on the high-resolution image to form the low-resolution image to be known. Unfortunately, this is not something that is available in real-life super-resolution applications such as increasing the quality of a photo taken on a mobile phone. In this paper we will evaluate multiple state-of-the-art super-resolution methods and gauge their performance when presented with various types of real-life images and discuss the benefits and drawbacks of each method. We also introduce a novel dataset, WideRealSR, containing real images from a wide variety of sources. Finally, through careful experimentation and evaluation, we will present a potential solution to alleviate the generalization problem which is imminent in most state-of-the-art super-resolution models.
Compared with the rapid development of single-frequency multi-polarization SAR image classification technology, there is less research on the land cover classification of multifrequency polarimetric SAR (MF-PolSAR) images. In addition, the current deep learning methods for MF-PolSAR classification are mainly based on convolutional neural networks (CNNs), only local spatiality is considered but the nonlocal relationship is ignored. Therefore, based on semantic interaction and nonlocal topological structure, this paper proposes the MF semantics and topology fusion network (MF-STFnet) to improve MF-PolSAR classification performance. In MF-STFnet, two kinds of classification are implemented for each band, semantic information-based (SIC) and topological property-based (TPC). They work collaboratively during MF-STFnet training, which can not only fully leverage the complementarity of bands, but also combine local and nonlocal spatial information to improve the discrimination between different categories. For SIC, the designed crossband interactive feature extraction module (CIFEM) is embedded to explicitly model the deep semantic correlation among bands, thereby leveraging the complementarity of bands to make ground objects more separable. For TPC, the graph sample and aggregate network (GraphSAGE) is employed to dynamically capture the representation of nonlocal topological relations between land cover categories. In this way, the robustness of classification can be further improved by combining nonlocal spatial information. Finally, an adaptive weighting fusion (AWF) strategy is proposed to merge inference from different bands, so as to make the MF joint classification decisions of SIC and TPC. The comparative experiments show that MF-STFnet can achieve more competitive classification performance than some state-of-the-art methods.
Diffusion models (DMs) have recently emerged as a promising method in image synthesis. They have surpassed generative adversarial networks (GANs) in both diversity and quality, and have achieved impressive results in text-to-image and image-to-image modeling. However, to date, only little attention has been paid to the detection of DM-generated images, which is critical to prevent adverse impacts on our society. Although prior work has shown that GAN-generated images can be reliably detected using automated methods, it is unclear whether the same methods are effective against DMs. In this work, we address this challenge and take a first look at detecting DM-generated images. We approach the problem from two different angles: First, we evaluate the performance of state-of-the-art detectors on a variety of DMs. Second, we analyze DM-generated images in the frequency domain and study different factors that influence the spectral properties of these images. Most importantly, we demonstrate that GANs and DMs produce images with different characteristics, which requires adaptation of existing classifiers to ensure reliable detection. We believe this work provides the foundation and starting point for further research to detect DM deepfakes effectively.
Multilayer perceptrons (MLPs) learn high frequencies slowly. Recent approaches encode features in spatial bins to improve speed of learning details, but at the cost of larger model size and loss of continuity. Instead, we propose to encode features in bins of Fourier features that are commonly used for positional encoding. We call these Quantized Fourier Features (QFF). As a naturally multiresolution and periodic representation, our experiments show that using QFF can result in smaller model size, faster training, and better quality outputs for several applications, including Neural Image Representations (NIR), Neural Radiance Field (NeRF) and Signed Distance Function (SDF) modeling. QFF are easy to code, fast to compute, and serve as a simple drop-in addition to many neural field representations.
The satellite image data downlink scheduling problem (SIDSP) is well studied in literature for traditional satellites. With recent developments in satellite technology, SIDSP for modern satellites became more complicated, adding new dimensions of complexities and additional opportunities for the effective use of the satellite. In this paper, we introduce the dynamic two-phase satellite image data downlink scheduling problem (D-SIDSP) which combines two interlinked operations of image data segmentation and image data downlink, in a dynamic way, and thereby offering additional modelling flexibility and renewed capabilities. D-SIDSP is formulated as a bi-objective problem of optimizing the image data transmission rate and the service-balance degree. Harnessing the power of an adaptive large neighborhood search algorithm (ALNS) with a nondominated sorting genetic algorithm II (NSGA-II), an adaptive bi-objective memetic algorithm, ALNS+NSGA-II, is developed to solve D-SIDSP. Results of extensive computational experiments carried out using benchmark instances are also presented. Our experimental results disclose that the algorithm ALNS+NSGA-II is a viable alternative to solve D-SIDSP more efficiently and demonstrates superior outcomes based on various performance metrics. The paper also offers new benchmark instances for D-SIDSP that can be used in future research works on the topic.
Complex and contact-rich robotic manipulation tasks, particularly those that involve multi-fingered hands and underactuated object manipulation, present a significant challenge to any control method. Methods based on reinforcement learning offer an appealing choice for such settings, as they can enable robots to learn to delicately balance contact forces and dexterously reposition objects without strong modeling assumptions. However, running reinforcement learning on real-world dexterous manipulation systems often requires significant manual engineering. This negates the benefits of autonomous data collection and ease of use that reinforcement learning should in principle provide. In this paper, we describe a system for vision-based dexterous manipulation that provides a "programming-free" approach for users to define new tasks and enable robots with complex multi-fingered hands to learn to perform them through interaction. The core principle underlying our system is that, in a vision-based setting, users should be able to provide high-level intermediate supervision that circumvents challenges in teleoperation or kinesthetic teaching which allow a robot to not only learn a task efficiently but also to autonomously practice. Our system includes a framework for users to define a final task and intermediate sub-tasks with image examples, a reinforcement learning procedure that learns the task autonomously without interventions, and experimental results with a four-finger robotic hand learning multi-stage object manipulation tasks directly in the real world, without simulation, manual modeling, or reward engineering.
Due to the unequivocal need for understanding the decision processes of deep learning networks, both modal-dependent and model-agnostic techniques have become very popular. Although both of these ideas provide transparency for automated decision making, most methodologies focus on either using the modal-gradients (model-dependent) or ignoring the model internal states and reasoning with a model's behavior/outcome (model-agnostic) to instances. In this work, we propose a unified explanation approach that given an instance combines both model-dependent and agnostic explanations to produce an explanation set. The generated explanations are not only consistent in the neighborhood of a sample but can highlight causal relationships between image content and the outcome. We use Wireless Capsule Endoscopy (WCE) domain to illustrate the effectiveness of our explanations. The saliency maps generated by our approach are comparable or better on the softmax information score.