Alert button

"Image To Image Translation": models, code, and papers
Alert button

STEREOFOG -- Computational DeFogging via Image-to-Image Translation on a real-world Dataset

Dec 04, 2023
Anton Pollak, Rajesh Menon

Image-to-Image translation (I2I) is a subtype of Machine Learning (ML) that has tremendous potential in applications where two domains of images and the need for translation between the two exist, such as the removal of fog. For example, this could be useful for autonomous vehicles, which currently struggle with adverse weather conditions like fog. However, datasets for I2I tasks are not abundant and typically hard to acquire. Here, we introduce STEREOFOG, a dataset comprised of $10,067$ paired fogged and clear images, captured using a custom-built device, with the purpose of exploring I2I's potential in this domain. It is the only real-world dataset of this kind to the best of our knowledge. Furthermore, we apply and optimize the pix2pix I2I ML framework to this dataset. With the final model achieving an average Complex Wavelet-Structural Similarity (CW-SSIM) score of $0.76$, we prove the technique's suitability for the problem.

* 7 pages, 7 figures, for associated dataset and Supplement file, see 
Viaarxiv icon

S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion

Nov 30, 2023
Or Greenberg, Eran Kishon, Dani Lischinski

Image-to-image translation (I2IT) refers to the process of transforming images from a source domain to a target domain while maintaining a fundamental connection in terms of image content. In the past few years, remarkable advancements in I2IT were achieved by Generative Adversarial Networks (GANs), which nevertheless struggle with translations requiring high precision. Recently, Diffusion Models have established themselves as the engine of choice for image generation. In this paper we introduce S2ST, a novel framework designed to accomplish global I2IT in complex photorealistic images, such as day-to-night or clear-to-rain translations of automotive scenes. S2ST operates within the seed space of a Latent Diffusion Model, thereby leveraging the powerful image priors learned by the latter. We show that S2ST surpasses state-of-the-art GAN-based I2IT methods, as well as diffusion-based approaches, for complex automotive scenes, improving fidelity while respecting the target domain's appearance across a variety of domains. Notably, S2ST obviates the necessity for training domain-specific translation networks.

* 17 pages, 15 figures 
Viaarxiv icon

Open-DDVM: A Reproduction and Extension of Diffusion Model for Optical Flow Estimation

Dec 04, 2023
Qiaole Dong, Bo Zhao, Yanwei Fu

Recently, Google proposes DDVM which for the first time demonstrates that a general diffusion model for image-to-image translation task works impressively well on optical flow estimation task without any specific designs like RAFT. However, DDVM is still a closed-source model with the expensive and private Palette-style pretraining. In this technical report, we present the first open-source DDVM by reproducing it. We study several design choices and find those important ones. By training on 40k public data with 4 GPUs, our reproduction achieves comparable performance to the closed-source DDVM. The code and model have been released in

* Technical Report 
Viaarxiv icon

Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing

Nov 30, 2023
Hyelin Nam, Gihyun Kwon, Geon Yeong Park, Jong Chul Ye

With the remarkable advent of text-to-image diffusion models, image editing methods have become more diverse and continue to evolve. A promising recent approach in this realm is Delta Denoising Score (DDS) - an image editing technique based on Score Distillation Sampling (SDS) framework that leverages the rich generative prior of text-to-image diffusion models. However, relying solely on the difference between scoring functions is insufficient for preserving specific structural elements from the original image, a crucial aspect of image editing. Inspired by the similarity and importance differences between DDS and the contrastive learning for unpaired image-to-image translation (CUT), here we present an embarrassingly simple yet very powerful modification of DDS, called Contrastive Denoising Score (CDS), for latent diffusion models (LDM). Specifically, to enforce structural correspondence between the input and output while maintaining the controllability of contents, we introduce a straightforward approach to regulate structural consistency using CUT loss within the DDS framework. To calculate this loss, instead of employing auxiliary networks, we utilize the intermediate features of LDM, in particular, those from the self-attention layers, which possesses rich spatial information. Our approach enables zero-shot image-to-image translation and neural radiance field (NeRF) editing, achieving a well-balanced interplay between maintaining the structural details and transforming content. Qualitative results and comparisons demonstrates the effectiveness of our proposed method. Project page with code is available at

* Project page: 
Viaarxiv icon

InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars

Dec 03, 2023
Xiaochen Zhao, Jingxiang Sun, Lizhen Wang, Yebin Liu

While high fidelity and efficiency are central to the creation of digital head avatars, recent methods relying on 2D or 3D generative models often experience limitations such as shape distortion, expression inaccuracy, and identity flickering. Additionally, existing one-shot inversion techniques fail to fully leverage multiple input images for detailed feature extraction. We propose a novel framework, \textbf{Incremental 3D GAN Inversion}, that enhances avatar reconstruction performance using an algorithm designed to increase the fidelity from multiple frames, resulting in improved reconstruction quality proportional to frame count. Our method introduces a unique animatable 3D GAN prior with two crucial modifications for enhanced expression controllability alongside an innovative neural texture encoder that categorizes texture feature spaces based on UV parameterization. Differentiating from traditional techniques, our architecture emphasizes pixel-aligned image-to-image translation, mitigating the need to learn correspondences between observation and canonical spaces. Furthermore, we incorporate ConvGRU-based recurrent networks for temporal data aggregation from multiple frames, boosting geometry and texture detail reconstruction. The proposed paradigm demonstrates state-of-the-art performance on one-shot and few-shot avatar animation tasks.

Viaarxiv icon

Adaptive Latent Diffusion Model for 3D Medical Image to Image Translation: Multi-modal Magnetic Resonance Imaging Study

Nov 01, 2023
Jonghun Kim, Hyunjin Park

Figure 1 for Adaptive Latent Diffusion Model for 3D Medical Image to Image Translation: Multi-modal Magnetic Resonance Imaging Study
Figure 2 for Adaptive Latent Diffusion Model for 3D Medical Image to Image Translation: Multi-modal Magnetic Resonance Imaging Study
Figure 3 for Adaptive Latent Diffusion Model for 3D Medical Image to Image Translation: Multi-modal Magnetic Resonance Imaging Study
Figure 4 for Adaptive Latent Diffusion Model for 3D Medical Image to Image Translation: Multi-modal Magnetic Resonance Imaging Study

Multi-modal images play a crucial role in comprehensive evaluations in medical image analysis providing complementary information for identifying clinically important biomarkers. However, in clinical practice, acquiring multiple modalities can be challenging due to reasons such as scan cost, limited scan time, and safety considerations. In this paper, we propose a model based on the latent diffusion model (LDM) that leverages switchable blocks for image-to-image translation in 3D medical images without patch cropping. The 3D LDM combined with conditioning using the target modality allows generating high-quality target modality in 3D overcoming the shortcoming of the missing out-of-slice information in 2D generation methods. The switchable block, noted as multiple switchable spatially adaptive normalization (MS-SPADE), dynamically transforms source latents to the desired style of the target latents to help with the diffusion process. The MS-SPADE block allows us to have one single model to tackle many translation tasks of one source modality to various targets removing the need for many translation models for different scenarios. Our model exhibited successful image synthesis across different source-target modality scenarios and surpassed other models in quantitative evaluations tested on multi-modal brain magnetic resonance imaging datasets of four different modalities and an independent IXI dataset. Our model demonstrated successful image synthesis across various modalities even allowing for one-to-many modality translations. Furthermore, it outperformed other one-to-one translation models in quantitative evaluations.

* 8 pages, 7 figures, WACV 2024 Accepted 
Viaarxiv icon

H-Packer: Holographic Rotationally Equivariant Convolutional Neural Network for Protein Side-Chain Packing

Nov 28, 2023
Gian Marco Visani, William Galvin, Michael Neal Pun, Armita Nourmohammad

Accurately modeling protein 3D structure is essential for the design of functional proteins. An important sub-task of structure modeling is protein side-chain packing: predicting the conformation of side-chains (rotamers) given the protein's backbone structure and amino-acid sequence. Conventional approaches for this task rely on expensive sampling procedures over hand-crafted energy functions and rotamer libraries. Recently, several deep learning methods have been developed to tackle the problem in a data-driven way, albeit with vastly different formulations (from image-to-image translation to directly predicting atomic coordinates). Here, we frame the problem as a joint regression over the side-chains' true degrees of freedom: the dihedral $\chi$ angles. We carefully study possible objective functions for this task, while accounting for the underlying symmetries of the task. We propose Holographic Packer (H-Packer), a novel two-stage algorithm for side-chain packing built on top of two light-weight rotationally equivariant neural networks. We evaluate our method on CASP13 and CASP14 targets. H-Packer is computationally efficient and shows favorable performance against conventional physics-based algorithms and is competitive against alternative deep learning solutions.

* Accepted as a conference paper at MLCB 2023. 8 pages main body, 20 pages with appendix. 10 figures 
Viaarxiv icon

A deep learning approach for marine snow synthesis and removal

Nov 27, 2023
Fernando Galetto, Guang Deng

Marine snow, the floating particles in underwater images, severely degrades the visibility and performance of human and machine vision systems. This paper proposes a novel method to reduce the marine snow interference using deep learning techniques. We first synthesize realistic marine snow samples by training a Generative Adversarial Network (GAN) model and combine them with natural underwater images to create a paired dataset. We then train a U-Net model to perform marine snow removal as an image to image translation task. Our experiments show that the U-Net model can effectively remove both synthetic and natural marine snow with high accuracy, outperforming state-of-the-art methods such as the Median filter and its adaptive variant. We also demonstrate the robustness of our method by testing it on the MSRB dataset, which contains synthetic artifacts that our model has not seen during training. Our method is a practical and efficient solution for enhancing underwater images affected by marine snow.

Viaarxiv icon

Towards Unsupervised Representation Learning: Learning, Evaluating and Transferring Visual Representations

Nov 30, 2023
Bonifaz Stuhr

Unsupervised representation learning aims at finding methods that learn representations from data without annotation-based signals. Abstaining from annotations not only leads to economic benefits but may - and to some extent already does - result in advantages regarding the representation's structure, robustness, and generalizability to different tasks. In the long run, unsupervised methods are expected to surpass their supervised counterparts due to the reduction of human intervention and the inherently more general setup that does not bias the optimization towards an objective originating from specific annotation-based signals. While major advantages of unsupervised representation learning have been recently observed in natural language processing, supervised methods still dominate in vision domains for most tasks. In this dissertation, we contribute to the field of unsupervised (visual) representation learning from three perspectives: (i) Learning representations: We design unsupervised, backpropagation-free Convolutional Self-Organizing Neural Networks (CSNNs) that utilize self-organization- and Hebbian-based learning rules to learn convolutional kernels and masks to achieve deeper backpropagation-free models. (ii) Evaluating representations: We build upon the widely used (non-)linear evaluation protocol to define pretext- and target-objective-independent metrics for measuring and investigating the objective function mismatch between various unsupervised pretext tasks and target tasks. (iii) Transferring representations: We contribute CARLANE, the first 3-way sim-to-real domain adaptation benchmark for 2D lane detection, and a method based on prototypical self-supervised learning. Finally, we contribute a content-consistent unpaired image-to-image translation method that utilizes masks, global and local discriminators, and similarity sampling to mitigate content inconsistencies.

* PhD Thesis, 223 pages, Abstract in English, Spanish and Catalan, 4 appendices 
Viaarxiv icon

Fine-grained Appearance Transfer with Diffusion Models

Nov 27, 2023
Yuteng Ye, Guanwen Li, Hang Zhou, Cai Jiale, Junqing Yu, Yawei Luo, Zikai Song, Qilong Xing, Youjia Zhang, Wei Yang

Image-to-image translation (I2I), and particularly its subfield of appearance transfer, which seeks to alter the visual appearance between images while maintaining structural coherence, presents formidable challenges. Despite significant advancements brought by diffusion models, achieving fine-grained transfer remains complex, particularly in terms of retaining detailed structural elements and ensuring information fidelity. This paper proposes an innovative framework designed to surmount these challenges by integrating various aspects of semantic matching, appearance transfer, and latent deviation. A pivotal aspect of our approach is the strategic use of the predicted $x_0$ space by diffusion models within the latent space of diffusion processes. This is identified as a crucial element for the precise and natural transfer of fine-grained details. Our framework exploits this space to accomplish semantic alignment between source and target images, facilitating mask-wise appearance transfer for improved feature acquisition. A significant advancement of our method is the seamless integration of these features into the latent space, enabling more nuanced latent deviations without necessitating extensive model retraining or fine-tuning. The effectiveness of our approach is demonstrated through extensive experiments, which showcase its ability to adeptly handle fine-grained appearance transfers across a wide range of categories and domains. We provide our code at

* 14 pages, 15 figures 
Viaarxiv icon