Fast arbitrary neural style transfer has attracted widespread attention from academic, industrial and art communities due to its flexibility in enabling various applications. Existing solutions either attentively fuse deep style feature into deep content feature without considering feature distributions, or adaptively normalize deep content feature according to the style such that their global statistics are matched. Although effective, leaving shallow feature unexplored and without locally considering feature statistics, they are prone to unnatural output with unpleasing local distortions. To alleviate this problem, in this paper, we propose a novel attention and normalization module, named Adaptive Attention Normalization (AdaAttN), to adaptively perform attentive normalization on per-point basis. Specifically, spatial attention score is learnt from both shallow and deep features of content and style images. Then per-point weighted statistics are calculated by regarding a style feature point as a distribution of attention-weighted output of all style feature points. Finally, the content feature is normalized so that they demonstrate the same local feature statistics as the calculated per-point weighted style feature statistics. Besides, a novel local feature loss is derived based on AdaAttN to enhance local visual quality. We also extend AdaAttN to be ready for video style transfer with slight modifications. Experiments demonstrate that our method achieves state-of-the-art arbitrary image/video style transfer. Codes and models are available.
A new Plug-and-Play (PnP) alternating direction of multipliers (ADMM) scheme is proposed in this paper, by embedding a recently introduced adaptive denoiser using the Schroedinger equation's solutions of quantum physics. The potential of the proposed model is studied for Poisson image deconvolution, which is a common problem occurring in number of imaging applications, such as, for example, limited photon acquisition or X-ray computed tomography. Numerical results show the efficiency and good adaptability of the proposed scheme compared to recent state-of-the-art techniques, for both high and low signal-to-noise ratio scenarios. This performance gain regardless of the amount of noise affecting the observations is explained by the flexibility of the embedded quantum denoiser constructed without anticipating any prior statistics about the noise, which is one of the main advantages of this method.
High dynamic range (HDR) video reconstruction from sequences captured with alternating exposures is a very challenging problem. Existing methods often align low dynamic range (LDR) input sequence in the image space using optical flow, and then merge the aligned images to produce HDR output. However, accurate alignment and fusion in the image space are difficult due to the missing details in the over-exposed regions and noise in the under-exposed regions, resulting in unpleasing ghosting artifacts. To enable more accurate alignment and HDR fusion, we introduce a coarse-to-fine deep learning framework for HDR video reconstruction. Firstly, we perform coarse alignment and pixel blending in the image space to estimate the coarse HDR video. Secondly, we conduct more sophisticated alignment and temporal fusion in the feature space of the coarse HDR video to produce better reconstruction. Considering the fact that there is no publicly available dataset for quantitative and comprehensive evaluation of HDR video reconstruction methods, we collect such a benchmark dataset, which contains $97$ sequences of static scenes and 184 testing pairs of dynamic scenes. Extensive experiments show that our method outperforms previous state-of-the-art methods. Our dataset, code and model will be made publicly available.
Unsupervised representation learning has achieved outstanding performances using centralized data available on the Internet. However, the increasing awareness of privacy protection limits sharing of decentralized unlabeled image data that grows explosively in multiple parties (e.g., mobile phones and cameras). As such, a natural problem is how to leverage these data to learn visual representations for downstream tasks while preserving data privacy. To address this problem, we propose a novel federated unsupervised learning framework, FedU. In this framework, each party trains models from unlabeled data independently using contrastive learning with an online network and a target network. Then, a central server aggregates trained models and updates clients' models with the aggregated model. It preserves data privacy as each party only has access to its raw data. Decentralized data among multiple parties are normally non-independent and identically distributed (non-IID), leading to performance degradation. To tackle this challenge, we propose two simple but effective methods: 1) We design the communication protocol to upload only the encoders of online networks for server aggregation and update them with the aggregated encoder; 2) We introduce a new module to dynamically decide how to update predictors based on the divergence caused by non-IID. The predictor is the other component of the online network. Extensive experiments and ablations demonstrate the effectiveness and significance of FedU. It outperforms training with only one party by over 5% and other methods by over 14% in linear and semi-supervised evaluation on non-IID data.
The ability to accurately estimate risk of developing breast cancer would be invaluable for clinical decision-making. One promising new approach is to integrate image-based risk models based on deep neural networks. However, one must take care when using such models, as selection of training data influences the patterns the network will learn to identify. With this in mind, we trained networks using three different criteria to select the positive training data (i.e. images from patients that will develop cancer): an inherent risk model trained on images with no visible signs of cancer, a cancer signs model trained on images containing cancer or early signs of cancer, and a conflated model trained on all images from patients with a cancer diagnosis. We find that these three models learn distinctive features that focus on different patterns, which translates to contrasts in performance. Short-term risk is best estimated by the cancer signs model, whilst long-term risk is best estimated by the inherent risk model. Carelessly training with all images conflates inherent risk with early cancer signs, and yields sub-optimal estimates in both regimes. As a consequence, conflated models may lead physicians to recommend preventative action when early cancer signs are already visible.
Supervised deep learning has swiftly become a workhorse for accelerated MRI in recent years, offering state-of-the-art performance in image reconstruction from undersampled acquisitions. Training deep supervised models requires large datasets of undersampled and fully-sampled acquisitions typically from a matching set of subjects. Given scarce access to large medical datasets, this limitation has sparked interest in unsupervised methods that reduce reliance on fully-sampled ground-truth data. A common framework is based on the deep image prior, where network-driven regularization is enforced directly during inference on undersampled acquisitions. Yet, canonical convolutional architectures are suboptimal in capturing long-range relationships, and randomly initialized networks may hamper convergence. To address these limitations, here we introduce a novel unsupervised MRI reconstruction method based on zero-Shot Learned Adversarial TransformERs (SLATER). SLATER embodies a deep adversarial network with cross-attention transformer blocks to map noise and latent variables onto MR images. This unconditional network learns a high-quality MRI prior in a self-supervised encoding task. A zero-shot reconstruction is performed on undersampled test data, where inference is performed by optimizing network parameters, latent and noise variables to ensure maximal consistency to multi-coil MRI data. Comprehensive experiments on brain MRI datasets clearly demonstrate the superior performance of SLATER against several state-of-the-art unsupervised methods.
Satellite images are an extremely valuable resource in the aftermath of natural disasters such as hurricanes and tsunamis where they can be used for risk assessment and disaster management. In order to provide timely and actionable information for disaster response, in this paper a framework utilising segmentation neural networks is proposed to identify impacted areas and accessible roads in post-disaster scenarios. The effectiveness of pretraining with ImageNet on the task of aerial image segmentation has been analysed and performances of popular segmentation models compared. Experimental results show that pretraining on ImageNet usually improves the segmentation performance for a number of models. Open data available from OpenStreetMap (OSM) is used for training, forgoing the need for time-consuming manual annotation. The method also makes use of graph theory to update road network data available from OSM and to detect the changes caused by a natural disaster. Extensive experiments on data from the 2018 tsunami that struck Palu, Indonesia show the effectiveness of the proposed framework. ENetSeparable, with 30% fewer parameters compared to ENet, achieved comparable segmentation results to that of the state-of-the-art networks.
Self-attention (SA) mechanisms can capture effectively global dependencies in deep neural networks, and have been applied to natural language processing and image processing successfully. However, SA modules for image reconstruction have high time and space complexity, which restrict their applications to higher-resolution images. In this paper, we refine the SA module in self-attention generative adversarial networks (SAGAN) via adapting a non-local operation, revising the connectivity among the units in SA module and re-implementing its computational pattern, such that its time and space complexity is reduced from $\text{O}(n^2)$ to $\text{O}(n)$, but it is still equivalent to the original SA module. Further, we explore the principles behind the module and discover that our module is a special kind of channel attention mechanisms. Experimental results based on two benchmark datasets of image reconstruction, verify that under the same computational environment, two models can achieve comparable effectiveness for image reconstruction, but the proposed one runs faster and takes up less memory space.
Pre-training has enabled many state-of-the-art results on many tasks. In spite of its recognized contribution to generalization, we observed in this study that pre-training also transfers the non-robustness from pre-trained model into the fine-tuned model. Using image classification as an example, we first conducted experiments on various datasets and network backbones to explore the factors influencing robustness. Further analysis is conducted on examining the difference between the fine-tuned model and standard model to uncover the reason leading to the non-robustness transfer. Finally, we introduce a simple robust pre-training solution by regularizing the difference between target and source tasks. Results validate the effectiveness in alleviating non-robustness and preserving generalization.
Acquisition of Synthetic Aperture Sonar (SAS) datasets is bottlenecked by the costly deployment of SAS imaging systems, and even when data acquisition is possible,the data is often skewed towards containing barren seafloor rather than objects of interest. We present a novel pipeline, called SAS GAN, which couples an optical renderer with a generative adversarial network (GAN) to synthesize realistic SAS images of targets on the seafloor. This coupling enables high levels of SAS image realism while enabling control over image geometry and parameters. We demonstrate qualitative results by presenting examples of images created with our pipeline. We also present quantitative results through the use of t-SNE and the Fr\'echet Inception Distance to argue that our generated SAS imagery potentially augments SAS datasets more effectively than an off-the-shelf GAN.