Given a composite image, image harmonization aims to adjust the foreground illumination to be consistent with background. Previous methods have explored transforming foreground features to achieve competitive performance. In this work, we show that using global information to guide foreground feature transformation could achieve significant improvement. Besides, we propose to transfer the foreground-background relation from real images to composite images, which can provide intermediate supervision for the transformed encoder features. Additionally, considering the drawbacks of existing harmonization datasets, we also contribute a ccHarmony dataset which simulates the natural illumination variation. Extensive experiments on iHarmony4 and our contributed dataset demonstrate the superiority of our method. Our ccHarmony dataset is released at https://github.com/bcmi/Image-Harmonization-Dataset-ccHarmony.
Remote sensing image super-resolution (RSISR) plays a vital role in enhancing spatial detials and improving the quality of satellite imagery. Recently, Transformer-based models have shown competitive performance in RSISR. To mitigate the quadratic computational complexity resulting from global self-attention, various methods constrain attention to a local window, enhancing its efficiency. Consequently, the receptive fields in a single attention layer are inadequate, leading to insufficient context modeling. Furthermore, while most transform-based approaches reuse shallow features through skip connections, relying solely on these connections treats shallow and deep features equally, impeding the model's ability to characterize them. To address these issues, we propose a novel transformer architecture called Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network (SPIFFNet) for RSISR. Our proposed model effectively enhances global cognition and understanding of the entire image, facilitating efficient integration of features cross-stages. The model incorporates cross-spatial pixel integration attention (CSPIA) to introduce contextual information into a local window, while cross-stage feature fusion attention (CSFFA) adaptively fuses features from the previous stage to improve feature expression in line with the requirements of the current stage. We conducted comprehensive experiments on multiple benchmark datasets, demonstrating the superior performance of our proposed SPIFFNet in terms of both quantitative metrics and visual quality when compared to state-of-the-art methods.
We propose Hierarchical ProtoPNet: an interpretable network that explains its reasoning process by considering the hierarchical relationship between classes. Different from previous methods that explain their reasoning process by dissecting the input image and finding the prototypical parts responsible for the classification, we propose to explain the reasoning process for video action classification by dissecting the input video frames on multiple levels of the class hierarchy. The explanations leverage the hierarchy to deal with uncertainty, akin to human reasoning: When we observe water and human activity, but no definitive action it can be recognized as the water sports parent class. Only after observing a person swimming can we definitively refine it to the swimming action. Experiments on ActivityNet and UCF-101 show performance improvements while providing multi-level explanations.
Inharmonious region localization aims to localize the region in a synthetic image which is incompatible with surrounding background. The inharmony issue is mainly attributed to the color and illumination inconsistency produced by image editing techniques. In this work, we tend to transform the input image to another color space to magnify the domain discrepancy between inharmonious region and background, so that the model can identify the inharmonious region more easily. To this end, we present a novel framework consisting of a color mapping module and an inharmonious region localization network, in which the former is equipped with a novel domain discrepancy magnification loss and the latter could be an arbitrary localization network. Extensive experiments on image harmonization dataset show the superiority of our designed framework. Our code is available at https://github.com/bcmi/MadisNet-Inharmonious-Region-Localization.
Superimposing visible watermarks on images provides a powerful weapon to cope with the copyright issue. Watermark removal techniques, which can strengthen the robustness of visible watermarks in an adversarial way, have attracted increasing research interest. Modern watermark removal methods perform watermark localization and background restoration simultaneously, which could be viewed as a multi-task learning problem. However, existing approaches suffer from incomplete detected watermark and degraded texture quality of restored background. Therefore, we design a two-stage multi-task network to address the above issues. The coarse stage consists of a watermark branch and a background branch, in which the watermark branch self-calibrates the roughly estimated mask and passes the calibrated mask to background branch to reconstruct the watermarked area. In the refinement stage, we integrate multi-level features to improve the texture quality of watermarked area. Extensive experiments on two datasets demonstrate the effectiveness of our proposed method.
Variational Autoencoders (VAEs) hold great potential for modelling text, as they could in theory separate high-level semantic and syntactic properties from local regularities of natural language. Practically, however, VAEs with autoregressive decoders often suffer from posterior collapse, a phenomenon where the model learns to ignore the latent variables, causing the sequence VAE to degenerate into a language model. Previous works attempt to solve this problem with complex architectural changes or costly optimization schemes. In this paper, we argue that posterior collapse is caused in part by the encoder network failing to capture the input variabilities. We verify this hypothesis empirically and propose a straightforward fix using pooling. This simple technique effectively prevents posterior collapse, allowing the model to achieve significantly better data log-likelihood than standard sequence VAEs. Compared to the previous SOTA on preventing posterior collapse, we are able to achieve comparable performances while being significantly faster.
Coherence is an important aspect of text quality and is crucial for ensuring its readability. One important limitation of existing coherence models is that training on one domain does not easily generalize to unseen categories of text. Previous work advocates for generative models for cross-domain generalization, because for discriminative models, the space of incoherent sentence orderings to discriminate against during training is prohibitively large. In this work, we propose a local discriminative neural model with a much smaller negative sampling space that can efficiently learn against incorrect orderings. The proposed coherence model is simple in structure, yet it significantly outperforms previous state-of-art methods on a standard benchmark dataset on the Wall Street Journal corpus, as well as in multiple new challenging settings of transfer to unseen categories of discourse on Wikipedia articles.
In this paper, with respect to multichannel synthetic aperture radars (SAR), we first formulate the problems of Doppler ambiguities on the radial velocity (RV) estimation of a ground moving target in range-compressed domain, range-Doppler domain and image domain, respectively. It is revealed that in these problems, a cascaded time-space Doppler ambiguity (CTSDA) may encounter, i.e., time domain Doppler ambiguity (TDDA) in each channel arises first and then spatial domain Doppler ambiguity (SDDA) among multi-channels arises second. Accordingly, the multichannel SAR systems with different parameters are investigated in three different cases with diverse Doppler ambiguity properties, and a multi-frequency SAR is then proposed to obtain the RV estimation by solving the ambiguity problem based on Chinese remainder theorem (CRT). In the first two cases, the ambiguity problem can be solved by the existing closed-form robust CRT. In the third case, it is found that the problem is different from the conventional CRT problems and we call it a double remaindering problem in this paper. We then propose a sufficient condition under which the double remaindering problem, i.e., the CTSDA, can also be solved by the closed-form robust CRT. When the sufficient condition is not satisfied for a multi-channel SAR, a searching based method is proposed. Finally, some results of numerical experiments are provided to demonstrate the effectiveness of the proposed methods.
Recent work in learning vector-space embeddings for multi-relational data has focused on combining relational information derived from knowledge bases with distributional information derived from large text corpora. We propose a simple approach that leverages the descriptions of entities or phrases available in lexical resources, in conjunction with distributional semantics, in order to derive a better initialization for training relational models. Applying this initialization to the TransE model results in significant new state-of-the-art performances on the WordNet dataset, decreasing the mean rank from the previous best of 212 to 51. It also results in faster convergence of the entity representations. We find that there is a trade-off between improving the mean rank and the hits@10 with this approach. This illustrates that much remains to be understood regarding performance improvements in relational models.