Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Image To Image Translation": models, code, and papers

Imagination improves Multimodal Translation

Jul 07, 2017
Desmond Elliott, Ákos Kádár

We decompose multimodal translation into two sub-tasks: learning to translate and learning visually grounded representations. In a multitask learning framework, translations are learned in an attention-based encoder-decoder, and grounded representations are learned through image representation prediction. Our approach improves translation performance compared to the state of the art on the Multi30K dataset. Furthermore, it is equally effective if we train the image prediction task on the external MS COCO dataset, and we find improvements if we train the translation model on the external News Commentary parallel text.

* Clarified main contributions, minor correction to Equation 8, additional comparisons in Table 2, added more related work 

Vit-GAN: Image-to-image Translation with Vision Transformes and Conditional GANS

Oct 11, 2021
Yiğit Gündüç

In this paper, we have developed a general-purpose architecture, Vit-Gan, capable of performing most of the image-to-image translation tasks from semantic image segmentation to single image depth perception. This paper is a follow-up paper, an extension of generator-based model [1] in which the obtained results were very promising. This opened the possibility of further improvements with adversarial architecture. We used a unique vision transformers-based generator architecture and Conditional GANs(cGANs) with a Markovian Discriminator (PatchGAN) ( In the present work, we use images as conditioning arguments. It is observed that the obtained results are more realistic than the commonly used architectures.


DeepFacePencil: Creating Face Images from Freehand Sketches

Aug 31, 2020
Yuhang Li, Xuejin Chen, Binxin Yang, Zihan Chen, Zhihua Cheng, Zheng-Jun Zha

In this paper, we explore the task of generating photo-realistic face images from hand-drawn sketches. Existing image-to-image translation methods require a large-scale dataset of paired sketches and images for supervision. They typically utilize synthesized edge maps of face images as training data. However, these synthesized edge maps strictly align with the edges of the corresponding face images, which limit their generalization ability to real hand-drawn sketches with vast stroke diversity. To address this problem, we propose DeepFacePencil, an effective tool that is able to generate photo-realistic face images from hand-drawn sketches, based on a novel dual generator image translation network during training. A novel spatial attention pooling (SAP) is designed to adaptively handle stroke distortions which are spatially varying to support various stroke styles and different levels of details. We conduct extensive experiments and the results demonstrate the superiority of our model over existing methods on both image quality and model generalization to hand-drawn sketches.

* ACM MM 2020 (oral) 

Multi30K: Multilingual English-German Image Descriptions

May 02, 2016
Desmond Elliott, Stella Frank, Khalil Sima'an, Lucia Specia

We introduce the Multi30K dataset to stimulate multilingual multimodal research. Recent advances in image description have been demonstrated on English-language datasets almost exclusively, but image description should not be limited to English. This dataset extends the Flickr30K dataset with i) German translations created by professional translators over a subset of the English descriptions, and ii) descriptions crowdsourced independently of the original English descriptions. We outline how the data can be used for multilingual image description and multimodal machine translation, but we anticipate the data will be useful for a broader range of tasks.


Quantifying Translation-Invariance in Convolutional Neural Networks

Dec 10, 2017
Eric Kauderer-Abrams

A fundamental problem in object recognition is the development of image representations that are invariant to common transformations such as translation, rotation, and small deformations. There are multiple hypotheses regarding the source of translation invariance in CNNs. One idea is that translation invariance is due to the increasing receptive field size of neurons in successive convolution layers. Another possibility is that invariance is due to the pooling operation. We develop a simple a tool, the translation-sensitivity map, which we use to visualize and quantify the translation-invariance of various architectures. We obtain the surprising result that architectural choices such as the number of pooling layers and the convolution filter size have only a secondary effect on the translation-invariance of a network. Our analysis identifies training data augmentation as the most important factor in obtaining translation-invariant representations of images using convolutional neural networks.


Modular Generative Adversarial Networks

Apr 10, 2018
Bo Zhao, Bo Chang, Zequn Jie, Leonid Sigal

Existing methods for multi-domain image-to-image translation (or generation) attempt to directly map an input image (or a random vector) to an image in one of the output domains. However, most existing methods have limited scalability and robustness, since they require building independent models for each pair of domains in question. This leads to two significant shortcomings: (1) the need to train exponential number of pairwise models, and (2) the inability to leverage data from other domains when training a particular pairwise mapping. Inspired by recent work on module networks, this paper proposes ModularGAN for multi-domain image generation and image-to-image translation. ModularGAN consists of several reusable and composable modules that carry on different functions (e.g., encoding, decoding, transformations). These modules can be trained simultaneously, leveraging data from all domains, and then combined to construct specific GAN networks at test time, according to the specific image translation task. This leads to ModularGAN's superior flexibility of generating (or translating to) an image in any desired domain. Experimental results demonstrate that our model not only presents compelling perceptual results but also outperforms state-of-the-art methods on multi-domain facial attribute transfer.


Review Neural Networks about Image Transformation Based on IGC Learning Framework with Annotated Information

Jun 21, 2022
Yuanjie Yan, Suorong Yang, Yan Wang, Jian Zhao, Furao Shen

Image transformation, a class of vision and graphics problems whose goal is to learn the mapping between an input image and an output image, develops rapidly in the context of deep neural networks. In Computer Vision (CV), many problems can be regarded as the image transformation task, e.g., semantic segmentation and style transfer. These works have different topics and motivations, making the image transformation task flourishing. Some surveys only review the research on style transfer or image-to-image translation, all of which are just a branch of image transformation. However, none of the surveys summarize those works together in a unified framework to our best knowledge. This paper proposes a novel learning framework including Independent learning, Guided learning, and Cooperative learning, called the IGC learning framework. The image transformation we discuss mainly involves the general image-to-image translation and style transfer about deep neural networks. From the perspective of this framework, we review those subtasks and give a unified interpretation of various scenarios. We categorize related subtasks about the image transformation according to similar development trends. Furthermore, experiments have been performed to verify the effectiveness of IGC learning. Finally, new research directions and open problems are discussed for future research.


Cross-Domain Car Detection Using Unsupervised Image-to-Image Translation: From Day to Night

Jul 19, 2019
Vinicius F. Arruda, Thiago M. Paixão, Rodrigo F. Berriel, Alberto F. De Souza, Claudine Badue, Nicu Sebe, Thiago Oliveira-Santos

Deep learning techniques have enabled the emergence of state-of-the-art models to address object detection tasks. However, these techniques are data-driven, delegating the accuracy to the training dataset which must resemble the images in the target task. The acquisition of a dataset involves annotating images, an arduous and expensive process, generally requiring time and manual effort. Thus, a challenging scenario arises when the target domain of application has no annotated dataset available, making tasks in such situation to lean on a training dataset of a different domain. Sharing this issue, object detection is a vital task for autonomous vehicles where the large amount of driving scenarios yields several domains of application requiring annotated data for the training process. In this work, a method for training a car detection system with annotated data from a source domain (day images) without requiring the image annotations of the target domain (night images) is presented. For that, a model based on Generative Adversarial Networks (GANs) is explored to enable the generation of an artificial dataset with its respective annotations. The artificial dataset (fake dataset) is created translating images from day-time domain to night-time domain. The fake dataset, which comprises annotated images of only the target domain (night images), is then used to train the car detector model. Experimental results showed that the proposed method achieved significant and consistent improvements, including the increasing by more than 10% of the detection performance when compared to the training with only the available annotated data (i.e., day images).

* 8 pages, 8 figures, and accepted at IJCNN 2019 

DRIT++: Diverse Image-to-Image Translation via Disentangled Representations

May 02, 2019
Hsin-Ying Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang, Yu-Ding Lu, Maneesh Singh, Ming-Hsuan Yang

Image-to-image translation aims to learn the mapping between two visual domains. There are two main challenges for this task: 1) lack of aligned training pairs and 2) multiple possible outputs from a single input image. In this work, we present an approach based on disentangled representation for generating diverse outputs without paired training images. To synthesize diverse outputs, we propose to embed images onto two spaces: a domain-invariant content space capturing shared information across domains and a domain-specific attribute space. Our model takes the encoded content features extracted from a given input and attribute vectors sampled from the attribute space to synthesize diverse outputs at test time. To handle unpaired training data, we introduce a cross-cycle consistency loss based on disentangled representations. Qualitative results show that our model can generate diverse and realistic images on a wide range of tasks without paired training data. For quantitative evaluations, we measure realism with user study and Fr\'{e}chet inception distance, and measure diversity with the perceptual distance metric, Jensen-Shannon divergence, and number of statistically-different bins.

* Journal extension for ECCV 2018 "Diverse Image-to-Image Translation via Disentangled Representations" arXiv:1808.00948. Project Page: Code: 

SDA-GAN: Unsupervised Image Translation Using Spectral Domain Attention-Guided Generative Adversarial Network

Oct 06, 2021
Qizhou Wang, Maksim Makarenko

This work introduced a novel GAN architecture for unsupervised image translation on the task of face style transform. A spectral attention-based mechanism is embedded into the design along with spatial attention on the image contents. We proved that neural network has the potential of learning complex transformations such as Fourier transform, within considerable computational cost. The model is trained and tested in comparison to the baseline model, which only uses spatial attention. The performance improvement of our approach is significant especially when the source and target domain include different complexity (reduced FID to 49.18 from 142.84). In the translation process, a spectra filling effect was introduced due to the implementation of FFT and spectral attention. Another style transfer task and real-world object translation are also studied in this paper.

* 7 pages, 3 figures