We propose a method to translate cartoon images to real world images using Generative Aderserial Network (GAN). Existing GAN-based image-to-image translation methods which are trained on paired datasets are impractical as the data is difficult to accumulate. Therefore, in this paper we exploit the Cycle-Consistent Adversarial Networks (CycleGAN) method for images translation which needs an unpaired dataset. By applying CycleGAN we show that our model is able to generate meaningful real world images from cartoon images. However, we implement another state of the art technique $-$ Deep Analogy $-$ to compare the performance of our approach.
Image-to-image translation aims at translating a particular style of an image to another. The synthesized images can be more photo-realistic and identity-preserving by decomposing the image into content and style in a disentangled manner. While existing models focus on designing specialized network architecture to separate the two components, this paper investigates how to explicitly constrain the content and style statistics of images. We achieve this goal by transforming the input image into high frequency and low frequency information, which correspond to the content and style, respectively. We regulate the frequency distribution from two aspects: a) a spatial level restriction to locally restrict the frequency distribution of images; b) a spectral level regulation to enhance the global consistency among images. On multiple datasets we show that the proposed approach consistently leads to significant improvements on top of various state-of-the-art image translation models.
While attention-based transformer networks achieve unparalleled success in nearly all language tasks, the large number of tokens coupled with the quadratic activation memory usage makes them prohibitive for visual tasks. As such, while language-to-language translation has been revolutionized by the transformer model, convolutional networks remain the de facto solution for image-to-image translation. The recently proposed MLP-Mixer architecture alleviates some of the speed and memory issues associated with attention-based networks while still retaining the long-range connections that make transformer models desirable. Leveraging this efficient alternative to self-attention, we propose a new unpaired image-to-image translation model called MixerGAN: a simpler MLP-based architecture that considers long-distance relationships between pixels without the need for expensive attention mechanisms. Quantitative and qualitative analysis shows that MixerGAN achieves competitive results when compared to prior convolutional-based methods.
Supervision for image-to-image translation (I2I) tasks is hard to come by, but bears significant effect on the resulting quality. In this paper, we observe that for many Unsupervised I2I (UI2I) scenarios, one domain is more familiar than the other, and offers in-domain prior knowledge, such as semantic segmentation. We argue that for complex scenes, figuring out the semantic structure of the domain is hard, especially with no supervision, but is an important part of a successful I2I operation. We hence introduce two techniques to incorporate this invaluable in-domain prior knowledge for the benefit of translation quality: through a novel Multi-Stream generator architecture, and through a semantic segmentation-based regularization loss term. In essence, we propose splitting the input data according to semantic masks, explicitly guiding the network to different behavior for the different regions of the image. In addition, we propose training a semantic segmentation network along with the translation task, and to leverage this output as a loss term that improves robustness. We validate our approach on urban data, demonstrating superior quality in the challenging UI2I tasks of converting day images to night ones. In addition, we also demonstrate how reinforcing the target dataset with our augmented images improves the training of downstream tasks such as the classical detection one.
We address the task of multi-view image-to-image translation for person image generation. The goal is to synthesize photo-realistic multi-view images with pose-consistency across all views. Our proposed end-to-end framework is based on a joint learning of multiple unpaired image-to-image translation models, one per camera viewpoint. The joint learning is imposed by constraints on the shared 3D human pose in order to encourage the 2D pose projections in all views to be consistent. Experimental results on the CMU-Panoptic dataset demonstrate the effectiveness of the suggested framework in generating photo-realistic images of persons with new poses that are more consistent across all views in comparison to a standard Image-to-Image baseline. The code is available at: https://github.com/sony-si/MultiView-Img2Img
Unsupervised image translation, which aims in translating two independent sets of images, is challenging in discovering the correct correspondences without paired data. Existing works build upon Generative Adversarial Network (GAN) such that the distribution of the translated images are indistinguishable from the distribution of the target set. However, such set-level constraints cannot learn the instance-level correspondences (e.g. aligned semantic parts in object configuration task). This limitation often results in false positives (e.g. geometric or semantic artifacts), and further leads to mode collapse problem. To address the above issues, we propose a novel framework for instance-level image translation by Deep Attention GAN (DA-GAN). Such a design enables DA-GAN to decompose the task of translating samples from two sets into translating instances in a highly-structured latent space. Specifically, we jointly learn a deep attention encoder, and the instancelevel correspondences could be consequently discovered through attending on the learned instance pairs. Therefore, the constraints could be exploited on both set-level and instance-level. Comparisons against several state-ofthe- arts demonstrate the superiority of our approach, and the broad application capability, e.g, pose morphing, data augmentation, etc., pushes the margin of domain translation problem.
Sketch-to-image (S2I) translation plays an important role in image synthesis and manipulation tasks, such as photo editing and colorization. Some specific S2I translation including sketch-to-photo and sketch-to-painting can be used as powerful tools in the art design industry. However, previous methods only support S2I translation with a single level of density, which gives less flexibility to users for controlling the input sketches. In this work, we propose the first multi-level density sketch-to-image translation framework, which allows the input sketch to cover a wide range from rough object outlines to micro structures. Moreover, to tackle the problem of noncontinuous representation of multi-level density input sketches, we project the density level into a continuous latent space, which can then be linearly controlled by a parameter. This allows users to conveniently control the densities of input sketches and generation of images. Moreover, our method has been successfully verified on various datasets for different applications including face editing, multi-modal sketch-to-photo translation, and anime colorization, providing coarse-to-fine levels of controls to these applications.
We present a general framework for exemplar-based image translation, which synthesizes a photo-realistic image from the input in a distinct domain (e.g., semantic segmentation mask, or edge map, or pose keypoints), given an exemplar image. The output has the style (e.g., color, texture) in consistency with the semantically corresponding objects in the exemplar. We propose to jointly learn the crossdomain correspondence and the image translation, where both tasks facilitate each other and thus can be learned with weak supervision. The images from distinct domains are first aligned to an intermediate domain where dense correspondence is established. Then, the network synthesizes images based on the appearance of semantically corresponding patches in the exemplar. We demonstrate the effectiveness of our approach in several image translation tasks. Our method is superior to state-of-the-art methods in terms of image quality significantly, with the image style faithful to the exemplar with semantic consistency. Moreover, we show the utility of our method for several applications
Unpaired image-to-image (I2I) translation has received considerable attention in pattern recognition and computer vision because of recent advancements in generative adversarial networks (GANs). However, due to the lack of explicit supervision, unpaired I2I models often fail to generate realistic images, especially in challenging datasets with different backgrounds and poses. Hence, stabilization is indispensable for real-world applications and GANs. Herein, we propose Augmented Cyclic Consistency Regularization (ACCR), a novel regularization method for unpaired I2I translation. Our main idea is to enforce consistency regularization originating from semi-supervised learning on the discriminators leveraging real, fake, reconstructed, and augmented samples. We regularize the discriminators to output similar predictions when fed pairs of original and perturbed images. We qualitatively clarify the generation property between unpaired I2I models and standard GANs, and explain why consistency regularization on fake and reconstructed samples works well. Quantitatively, our method outperforms the consistency regularized GAN (CR-GAN) in digit translations and demonstrates efficacy against several data augmentation variants and cycle-consistent constraints.
The performance of multi-domain image-to-image translation has been significantly improved by recent progress in deep generative models. Existing approaches can use a unified model to achieve translations between all the visual domains. However, their outcomes are far from satisfying when there are large domain variations. In this paper, we reveal that improving the strategy of sample selection is an effective solution. To select informative samples, we dynamically estimate sample importance during the training of Generative Adversarial Networks, presenting Informative Sample Mining Network. We theoretically analyze the relationship between the sample importance and the prediction of the global optimal discriminator. Then a practical importance estimation function based on general discriminators is derived. In addition, we propose a novel multi-stage sample training scheme to reduce sample hardness while preserving sample informativeness. Extensive experiments on a wide range of specific image-to-image translation tasks are conducted, and the results demonstrate our superiority over current state-of-the-art methods.