Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jong Chul Ye

KAIST Graduate School of AI

Deep Diffusion Image Prior for Efficient OOD Adaptation in 3D Inverse Problems

Jul 15, 2024

Hyungjin Chung, Jong Chul Ye

Abstract:Recent inverse problem solvers that leverage generative diffusion priors have garnered significant attention due to their exceptional quality. However, adaptation of the prior is necessary when there exists a discrepancy between the training and testing distributions. In this work, we propose deep diffusion image prior (DDIP), which generalizes the recent adaptation method of SCD by introducing a formal connection to the deep image prior. Under this framework, we propose an efficient adaptation method dubbed D3IP, specified for 3D measurements, which accelerates DDIP by orders of magnitude while achieving superior performance. D3IP enables seamless integration of 3D inverse solvers and thus leads to coherent 3D reconstruction. Moreover, we show that meta-learning techniques can also be applied to yield even better performance. We show that our method is capable of solving diverse 3D reconstructive tasks from the generative prior trained only with phantom images that are vastly different from the training set, opening up new opportunities of applying diffusion inverse solvers even when training with gold standard data is impossible. Code: https://github.com/HJ-harry/DDIP3D

* ECCV 2024, 25 pages, 8 figures

Via

Access Paper or Ask Questions

CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

Jun 12, 2024

Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, Jong Chul Ye

Figure 1 for CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

Figure 2 for CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

Figure 3 for CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

Figure 4 for CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

Abstract:Classifier-free guidance (CFG) is a fundamental tool in modern diffusion models for text-guided generation. Although effective, CFG has notable drawbacks. For instance, DDIM with CFG lacks invertibility, complicating image editing; furthermore, high guidance scales, essential for high-quality outputs, frequently result in issues like mode collapse. Contrary to the widespread belief that these are inherent limitations of diffusion models, this paper reveals that the problems actually stem from the off-manifold phenomenon associated with CFG, rather than the diffusion models themselves. More specifically, inspired by the recent advancements of diffusion model-based inverse problem solvers (DIS), we reformulate text-guidance as an inverse problem with a text-conditioned score matching loss, and develop CFG++, a novel approach that tackles the off-manifold challenges inherent in traditional CFG. CFG++ features a surprisingly simple fix to CFG, yet it offers significant improvements, including better sample quality for text-to-image generation, invertibility, smaller guidance scales, reduced mode collapse, etc. Furthermore, CFG++ enables seamless interpolation between unconditional and conditional sampling at lower guidance scales, consistently outperforming traditional CFG at all scales. Experimental results confirm that our method significantly enhances performance in text-to-image generation, DDIM inversion, editing, and solving inverse problems, suggesting a wide-ranging impact and potential applications in various fields that utilize text guidance. Project Page: https://cfgpp-diffusion.github.io/.

Via

Access Paper or Ask Questions

LDMol: Text-Conditioned Molecule Diffusion Model Leveraging Chemically Informative Latent Space

May 28, 2024

Jinho Chang, Jong Chul Ye

Abstract:With the emergence of diffusion models as the frontline of generative models, many researchers have proposed molecule generation techniques using conditional diffusion models. However, due to the fundamental nature of a molecule, which carries highly entangled correlations within a small number of atoms and bonds, it becomes difficult for a model to connect raw data with the conditions when the conditions become more complex as natural language. To address this, here we present a novel latent diffusion model dubbed LDMol, which enables a natural text-conditioned molecule generation. Specifically, LDMol is composed of three building blocks: a molecule encoder that produces a chemically informative feature space, a natural language-conditioned latent diffusion model using a Diffusion Transformer (DiT), and an autoregressive decoder for molecule re. In particular, recognizing that multiple SMILES notations can represent the same molecule, we employ a contrastive learning strategy to extract the chemical informative feature space. LDMol not only beats the existing baselines on the text-to-molecule generation benchmark but is also capable of zero-shot inference with unseen scenarios. Furthermore, we show that LDMol can be applied to downstream tasks such as molecule-to-text retrieval and text-driven molecule editing, demonstrating its versatility as a diffusion model.

Via

Access Paper or Ask Questions

MindFormer: A Transformer Architecture for Multi-Subject Brain Decoding via fMRI

May 28, 2024

Inhwa Han, Jaayeon Lee, Jong Chul Ye

Abstract:Research efforts to understand neural signals have been ongoing for many years, with visual decoding from fMRI signals attracting considerable attention. Particularly, the advent of image diffusion models has advanced the reconstruction of images from fMRI data significantly. However, existing approaches often introduce inter- and intra- subject variations in the reconstructed images, which can compromise accuracy. To address current limitations in multi-subject brain decoding, we introduce a new Transformer architecture called MindFormer. This model is specifically designed to generate fMRI-conditioned feature vectors that can be used for conditioning Stable Diffusion model. More specifically, MindFormer incorporates two key innovations: 1) a novel training strategy based on the IP-Adapter to extract semantically meaningful features from fMRI signals, and 2) a subject specific token and linear layer that effectively capture individual differences in fMRI signals while synergistically combines multi subject fMRI data for training. Our experimental results demonstrate that Stable Diffusion, when integrated with MindFormer, produces semantically consistent images across different subjects. This capability significantly surpasses existing models in multi-subject brain decoding. Such advancements not only improve the accuracy of our reconstructions but also deepen our understanding of neural processing variations among individuals.

Via

Access Paper or Ask Questions

Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection

May 27, 2024

Gihyun Kwon, Jangho Park, Jong Chul Ye

Figure 1 for Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection

Figure 2 for Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection

Figure 3 for Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection

Figure 4 for Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection

Abstract:While text-to-image models have achieved impressive capabilities in image generation and editing, their application across various modalities often necessitates training separate models. Inspired by existing method of single image editing with self attention injection and video editing with shared attention, we propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model. Specifically, we design a sampling method that facilitates editing consecutive images while maintaining semantic consistency utilizing shared self-attention features during both reference and consecutive image sampling processes. Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.

* Project Page: https://unifyediting.github.io/

Via

Access Paper or Ask Questions

Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models

Apr 05, 2024

Gihyun Kwon, Simon Jenni, Dingzeyu Li, Joon-Young Lee, Jong Chul Ye, Fabian Caba Heilbron

Figure 1 for Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models

Figure 2 for Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models

Figure 3 for Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models

Figure 4 for Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models

Abstract:While there has been significant progress in customizing text-to-image generation models, generating images that combine multiple personalized concepts remains challenging. In this work, we introduce Concept Weaver, a method for composing customized text-to-image diffusion models at inference time. Specifically, the method breaks the process into two steps: creating a template image aligned with the semantics of input prompts, and then personalizing the template using a concept fusion strategy. The fusion strategy incorporates the appearance of the target concepts into the template image while retaining its structural details. The results indicate that our method can generate multiple custom concepts with higher identity fidelity compared to alternative approaches. Furthermore, the method is shown to seamlessly handle more than two concepts and closely follow the semantic meaning of the input prompt without blending appearances across different subjects.

* CVPR 2024

Via

Access Paper or Ask Questions

Spectral Motion Alignment for Video Motion Transfer using Diffusion Models

Mar 22, 2024

Geon Yeong Park, Hyeonho Jeong, Sang Wan Lee, Jong Chul Ye

Figure 1 for Spectral Motion Alignment for Video Motion Transfer using Diffusion Models

Figure 2 for Spectral Motion Alignment for Video Motion Transfer using Diffusion Models

Figure 3 for Spectral Motion Alignment for Video Motion Transfer using Diffusion Models

Figure 4 for Spectral Motion Alignment for Video Motion Transfer using Diffusion Models

Abstract:The evolution of diffusion models has greatly impacted video generation and understanding. Particularly, text-to-video diffusion models (VDMs) have significantly facilitated the customization of input video with target appearance, motion, etc. Despite these advances, challenges persist in accurately distilling motion information from video frames. While existing works leverage the consecutive frame residual as the target motion vector, they inherently lack global motion context and are vulnerable to frame-wise distortions. To address this, we present Spectral Motion Alignment (SMA), a novel framework that refines and aligns motion vectors using Fourier and wavelet transforms. SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics, and mitigating spatial artifacts. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.

* Project page: https://geonyeong-park.github.io/spectral-motion-alignment/

Via

Access Paper or Ask Questions

OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

Mar 21, 2024

Kwanyoung Kim, Yujin Oh, Jong Chul Ye

Figure 1 for OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

Figure 2 for OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

Figure 3 for OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

Figure 4 for OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

Abstract:The recent success of CLIP has demonstrated promising results in zero-shot semantic segmentation by transferring muiltimodal knowledge to pixel-level classification. However, leveraging pre-trained CLIP knowledge to closely align text embeddings with pixel embeddings still has limitations in existing approaches. To address this issue, we propose OTSeg, a novel multimodal attention mechanism aimed at enhancing the potential of multiple text prompts for matching associated pixel embeddings. We first propose Multi-Prompts Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, which leads multiple text prompts to selectively focus on various semantic features within image pixels. Moreover, inspired by the success of Sinkformers in unimodal settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn Attention (MPSA), which effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings. Through extensive experiments, we demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three benchmark datasets.

* 22 pages, 7 figures

Via

Access Paper or Ask Questions

Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing

Mar 20, 2024

Hangeol Chang, Jinho Chang, Jong Chul Ye

Abstract:Despite recent advancements in text-to-image diffusion models facilitating various image editing techniques, complex text prompts often lead to an oversight of some requests due to a bottleneck in processing text information. To tackle this challenge, we present Ground-A-Score, a simple yet powerful model-agnostic image editing method by incorporating grounding during score distillation. This approach ensures a precise reflection of intricate prompt requirements in the editing outcomes, taking into account the prior knowledge of the object locations within the image. Moreover, the selective application with a new penalty coefficient and contrastive loss helps to precisely target editing areas while preserving the integrity of the objects in the source image. Both qualitative assessments and quantitative analyses confirm that Ground-A-Score successfully adheres to the intricate details of extended and multifaceted prompts, ensuring high-quality outcomes that respect the original image attributes.

Via

Access Paper or Ask Questions

Generalized Consistency Trajectory Models for Image Manipulation

Mar 19, 2024

Beomsu Kim, Jaemin Kim, Jeongsol Kim, Jong Chul Ye

Figure 1 for Generalized Consistency Trajectory Models for Image Manipulation

Figure 2 for Generalized Consistency Trajectory Models for Image Manipulation

Figure 3 for Generalized Consistency Trajectory Models for Image Manipulation

Figure 4 for Generalized Consistency Trajectory Models for Image Manipulation

Abstract:Diffusion-based generative models excel in unconditional generation, as well as on applied tasks such as image editing and restoration. The success of diffusion models lies in the iterative nature of diffusion: diffusion breaks down the complex process of mapping noise to data into a sequence of simple denoising tasks. Moreover, we are able to exert fine-grained control over the generation process by injecting guidance terms into each denoising step. However, the iterative process is also computationally intensive, often taking from tens up to thousands of function evaluations. Although consistency trajectory models (CTMs) enable traversal between any time points along the probability flow ODE (PFODE) and score inference with a single function evaluation, CTMs only allow translation from Gaussian noise to data. Thus, this work aims to unlock the full potential of CTMs by proposing generalized CTMs (GCTMs), which translate between arbitrary distributions via ODEs. We discuss the design space of GCTMs and demonstrate their efficacy in various image manipulation tasks such as image-to-image translation, restoration, and editing. Code: \url{https://github.com/1202kbs/GCTM}

Via

Access Paper or Ask Questions