Interest in image-to-image translation has grown substantially in recent years with the success of unsupervised models based on the cycle-consistency assumption. The achievements of these models have been limited to a particular subset of domains where this assumption yields good results, namely homogeneous domains that are characterized by style or texture differences. We tackle the challenging problem of image-to-image translation where the domains are defined by high-level shapes and contexts, as well as including significant clutter and heterogeneity. For this purpose, we introduce a novel GAN based on preserving intra-domain vector transformations in a latent space learned by a siamese network. The traditional GAN system introduced a discriminator network to guide the generator into generating images in the target domain. To this two-network system we add a third: a siamese network that guides the generator so that each original image shares semantics with its generated version. With this new three-network system, we no longer need to constrain the generators with the ubiquitous cycle-consistency restraint. As a result, the generators can learn mappings between more complex domains that differ from each other by large differences - not just style or texture.
Manipulating visual attributes of images through human-written text is a very challenging task. On the one hand, models have to learn the manipulation without the ground truth of the desired output. On the other hand, models have to deal with the inherent ambiguity of natural language. Previous research usually requires either the user to describe all the characteristics of the desired image or to use richly-annotated image captioning datasets. In this work, we propose a novel unsupervised approach, based on image-to-image translation, that alters the attributes of a given image through a command-like sentence such as "change the hair color to black". Contrarily to state-of-the-art approaches, our model does not require a human-annotated dataset nor a textual description of all the attributes of the desired image, but only those that have to be modified. Our proposed model disentangles the image content from the visual attributes, and it learns to modify the latter using the textual description, before generating a new image from the content and the modified attribute representation. Because text might be inherently ambiguous (blond hair may refer to different shadows of blond, e.g. golden, icy, sandy), our method generates multiple stochastic versions of the same translation. Experiments show that the proposed model achieves promising performances on two large-scale public datasets: CelebA and CUB. We believe our approach will pave the way to new avenues of research combining textual and speech commands with visual attributes.
Existing deep learning-based image deraining methods have achieved promising performance for synthetic rainy images, typically rely on the pairs of sharp images and simulated rainy counterparts. However, these methods suffer from significant performance drop when facing the real rain, because of the huge gap between the simplified synthetic rain and the complex real rain. In this work, we argue that the rain generation and removal are the two sides of the same coin and should be tightly coupled. To close the loop, we propose to jointly learn real rain generation and removal procedure within a unified disentangled image translation framework. Specifically, we propose a bidirectional disentangled translation network, in which each unidirectional network contains two loops of joint rain generation and removal for both the real and synthetic rain image, respectively. Meanwhile, we enforce the disentanglement strategy by decomposing the rainy image into a clean background and rain layer (rain removal), in order to better preserve the identity background via both the cycle-consistency loss and adversarial loss, and ease the rain layer translating between the real and synthetic rainy image. A counterpart composition with the entanglement strategy is symmetrically applied for rain generation. Extensive experiments on synthetic and real-world rain datasets show the superiority of proposed method compared to state-of-the-arts.
The image-to-image translation is a learning task to establish a visual mapping between an input and output image. The task has several variations differentiated based on the purpose of the translation, such as synthetic to real translation, photo to caricature translation, and many others. The problem has been tackled using different approaches, either through traditional computer vision methods, as well as deep learning approaches in recent trends. One approach currently deemed popular and effective is using the conditional generative adversarial network, also known shortly as cGAN. It is adapted to perform image-to-image translation tasks with typically two networks: a generator and a discriminator. This project aims to evaluate the robustness of the Pix2Pix model by applying the Pix2Pix model to datasets consisting of cartoonized images. Using the Pix2Pix model, it should be possible to train the network to generate real-life images from the cartoonized images.
Recent image-to-image (I2I) translation algorithms focus on learning the mapping from a source to a target domain. However, the continuous translation problem that synthesizes intermediate results between the two domains has not been well-studied in the literature. Generating a smooth sequence of intermediate results bridges the gap of two different domains, facilitating the morphing effect across domains. Existing I2I approaches are limited to either intra-domain or deterministic inter-domain continuous translation. In this work, we present an effective signed attribute vector, which enables continuous translation on diverse mapping paths across various domains. In particular, utilizing the sign operation to encode the domain information, we introduce a unified attribute space shared by all domains, thereby allowing the interpolation on attribute vectors of different domains. To enhance the visual quality of continuous translation results, we generate a trajectory between two sign-symmetrical attribute vectors and leverage the domain information of the interpolated results along the trajectory for adversarial training. We evaluate the proposed method on a wide range of I2I translation tasks. Both qualitative and quantitative results demonstrate that the proposed framework generates more high-quality continuous translation results against the state-of-the-art methods.
Inter-scanner and inter-protocol discrepancy in MRI datasets are known to lead to significant quantification variability. Hence image-to-image or scanner-to-scanner translation is a crucial frontier in the area of medical image analysis with a lot of potential applications. Nonetheless, a significant percentage of existing algorithms cannot explicitly exploit and preserve texture details from target scanners and offers individual solutions towards specialized task-specific architectures. In this paper, we design a multi-scale texture transfer to enrich the reconstruction images with more details. Specifically, after calculating textural similarity, the multi-scale texture can adaptively transfer the texture information from target images or reference images to restored images. Different from the pixel-wise matching space as done by previous algorithms, we match texture features in a multi-scale scheme implemented in the neural space. The matching mechanism can exploit multi-scale neural transfer that encourages the model to grasp more semantic-related and lesion-related priors from the target or reference images. We evaluate our multi-scale texture GAN on three different tasks without any task-specific modifications: cross-protocol super-resolution of diffusion MRI, T1-Flair, and Flair-T2 modality translation. Our multi-texture GAN rehabilitates more high-resolution structures (i.e., edges and anatomy), texture (i.e., contrast and pixel intensities), and lesion information (i.e., tumor). The extensively quantitative and qualitative experiments demonstrate that our method achieves superior results in inter-protocol or inter-scanner translation over state-of-the-art methods.
This paper deals with area-based subpixel image registration under rotation-isometric scaling-translation transformation hypothesis. Our approach is based on a parametrical modeling of geometrically transformed textural image fragments and maximum likelihood estimation of transformation vector between them. Due to the parametrical approach based on the fractional Brownian motion modeling of the local fragments texture, the proposed estimator MLfBm (ML stands for "Maximum Likelihood" and fBm for "Fractal Brownian motion") has the ability to better adapt to real image texture content compared to other methods relying on universal similarity measures like mutual information or normalized correlation. The main benefits are observed when assumptions underlying the fBm model are fully satisfied, e.g. for isotropic normally distributed textures with stationary increments. Experiments on both simulated and real images and for high and weak correlation between registered images show that the MLfBm estimator offers significant improvement compared to other state-of-the-art methods. It reduces translation vector, rotation angle and scaling factor estimation errors by a factor of about 1.75...2 and it decreases probability of false match by up to 5 times. Besides, an accurate confidence interval for MLfBm estimates can be obtained from the Cramer-Rao lower bound on rotation-scaling-translation parameters estimation error. This bound depends on texture roughness, noise level in reference and template images, correlation between these images and geometrical transformation parameters.
State-of-the-art methods in the unpaired image-to-image translation are capable of learning a mapping from a source domain to a target domain with unpaired image data. Though the existing methods have achieved promising results, they still produce unsatisfied artifacts, being able to convert low-level information while limited in transforming high-level semantics of input images. One possible reason is that generators do not have the ability to perceive the most discriminative semantic parts between the source and target domains, thus making the generated images low quality. In this paper, we propose a new Attention-Guided Generative Adversarial Networks (AttentionGAN) for the unpaired image-to-image translation task. AttentionGAN can identify the most discriminative semantic objects and minimize changes of unwanted parts for semantic manipulation problems without using extra data and models. The attention-guided generators in AttentionGAN are able to produce attention masks via a built-in attention mechanism, and then fuse the generation output with the attention masks to obtain high-quality target images. Accordingly, we also design a novel attention-guided discriminator which only considers attended regions. Extensive experiments are conducted on several generative tasks, demonstrating that the proposed model is effective to generate sharper and more realistic images compared with existing competitive models. The source code for the proposed AttentionGAN is available at https://github.com/Ha0Tang/AttentionGAN.