Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jong Chul Ye

KAIST Graduate School of AI

Fast Diffusion Sampler for Inverse Problems by Geometric Decomposition

Mar 10, 2023

Hyungjin Chung, Suhyeon Lee, Jong Chul Ye

Figure 1 for Fast Diffusion Sampler for Inverse Problems by Geometric Decomposition

Figure 2 for Fast Diffusion Sampler for Inverse Problems by Geometric Decomposition

Figure 3 for Fast Diffusion Sampler for Inverse Problems by Geometric Decomposition

Figure 4 for Fast Diffusion Sampler for Inverse Problems by Geometric Decomposition

Abstract:Diffusion models have shown exceptional performance in solving inverse problems. However, one major limitation is the slow inference time. While faster diffusion samplers have been developed for unconditional sampling, there has been limited research on conditional sampling in the context of inverse problems. In this study, we propose a novel and efficient diffusion sampling strategy that employs the geometric decomposition of diffusion sampling. Specifically, we discover that the samples generated from diffusion models can be decomposed into two orthogonal components: a ``denoised" component obtained by projecting the sample onto the clean data manifold, and a ``noise" component that induces a transition to the next lower-level noisy manifold with the addition of stochastic noise. Furthermore, we prove that, under some conditions on the clean data manifold, the conjugate gradient update for imposing conditioning from the denoised signal belongs to the clean manifold, resulting in a much faster and more accurate diffusion sampling. Our method is applicable regardless of the parameterization and setting (i.e., VE, VP). Notably, we achieve state-of-the-art reconstruction quality on challenging real-world medical inverse imaging problems, including multi-coil MRI reconstruction and 3D CT reconstruction. Moreover, our proposed method achieves more than 80 times faster inference time than the previous state-of-the-art method.

* 21 pages

Via

Access Paper or Ask Questions

Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model

Feb 27, 2023

Jaeyoung Huh, Sangjoon Park, Jeong Eun Lee, Jong Chul Ye

Figure 1 for Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model

Figure 2 for Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model

Figure 3 for Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model

Figure 4 for Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model

Abstract:Automatic Speech Recognition (ASR) is a technology that converts spoken words into text, facilitating interaction between humans and machines. One of the most common applications of ASR is Speech-To-Text (STT) technology, which simplifies user workflows by transcribing spoken words into text. In the medical field, STT has the potential to significantly reduce the workload of clinicians who rely on typists to transcribe their voice recordings. However, developing an STT model for the medical domain is challenging due to the lack of sufficient speech and text datasets. To address this issue, we propose a medical-domain text correction method that modifies the output text of a general STT system using the Vision Language Pre-training (VLP) method. VLP combines textual and visual information to correct text based on image knowledge. Our extensive experiments demonstrate that the proposed method offers quantitatively and clinically significant improvements in STT performance in the medical field. We further show that multi-modal understanding of image and text information outperforms single-modal understanding using only text information.

Via

Access Paper or Ask Questions

Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models

Feb 08, 2023

Hyeonho Jeong, Gihyun Kwon, Jong Chul Ye

Abstract:Recent advancements in large scale text-to-image models have opened new possibilities for guiding the creation of images through human-devised natural language. However, while prior literature has primarily focused on the generation of individual images, it is essential to consider the capability of these models to ensure coherency within a sequence of images to fulfill the demands of real-world applications such as storytelling. To address this, here we present a novel neural pipeline for generating a coherent storybook from the plain text of a story. Specifically, we leverage a combination of a pre-trained Large Language Model and a text-guided Latent Diffusion Model to generate coherent images. While previous story synthesis frameworks typically require a large-scale text-to-image model trained on expensive image-caption pairs to maintain the coherency, we employ simple textual inversion techniques along with detector-based semantic image editing which allows zero-shot generation of the coherent storybook. Experimental results show that our proposed method outperforms state-of-the-art image editing baselines.

Via

Access Paper or Ask Questions

Minimizing Trajectory Curvature of ODE-based Generative Models

Feb 05, 2023

Sangyun Lee, Beomsu Kim, Jong Chul Ye

Abstract:Recent ODE/SDE-based generative models, such as diffusion models, rectified flows, and flow matching, define a generative process as a time reversal of a fixed forward process. Even though these models show impressive performance on large-scale datasets, numerical simulation requires multiple evaluations of a neural network, leading to a slow sampling speed. We attribute the reason to the high curvature of the learned generative trajectories, as it is directly related to the truncation error of a numerical solver. Based on the relationship between the forward process and the curvature, here we present an efficient method of training the forward process to minimize the curvature of generative trajectories without any ODE/SDE simulation. Experiments show that our method achieves a lower curvature than previous models and, therefore, decreased sampling costs while maintaining competitive performance. Code is available at https://github.com/sangyun884/fast-ode.

Via

Access Paper or Ask Questions

Don't Play Favorites: Minority Guidance for Diffusion Models

Jan 29, 2023

Soobin Um, Jong Chul Ye

Figure 1 for Don't Play Favorites: Minority Guidance for Diffusion Models

Figure 2 for Don't Play Favorites: Minority Guidance for Diffusion Models

Figure 3 for Don't Play Favorites: Minority Guidance for Diffusion Models

Figure 4 for Don't Play Favorites: Minority Guidance for Diffusion Models

Abstract:We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating sufficient numbers of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (that lie on high-density regions of the manifold) due to their high likelihoods, making themselves highly ineffective and time-consuming for the task. In this work, we present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We first provide a new insight on the majority-focused nature of the diffusion models: they denoise in favor of the majority samples. The observation motivates us to introduce a metric that describes the uniqueness of a given sample. To address the inherent preference of the diffusion models w.r.t. the majority samples, we further develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels. Experiments on benchmark real datasets demonstrate that our minority guidance can greatly improve the capability of generating the low-likelihood minority samples over existing generative frameworks including the standard diffusion sampler.

Via

Access Paper or Ask Questions

ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts

Jan 28, 2023

Kwanyoung Kim, Yujin Oh, Jong Chul Ye

Figure 1 for ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts

Figure 2 for ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts

Figure 3 for ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts

Figure 4 for ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts

Abstract:Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we present a cost-effective strategy using text-prompt learning that keeps the entire CLIP module frozen while fully leveraging its rich information. Specifically, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport, which allows each text prompt to efficiently focus on specific semantic attributes. Additionally, we propose Deep Local Feature Alignment (DLFA) that deeply aligns the text prompts with intermediate local feature of the frozen image encoder layers, which significantly boosts the zero-shot segmentation performance. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance with only x7 lighter parameters compared to previous SOTA approaches.

* 16pages, 9 figures

Via

Access Paper or Ask Questions

Annealed Score-Based Diffusion Model for MR Motion Artifact Reduction

Jan 08, 2023

Gyutaek Oh, Jeong Eun Lee, Jong Chul Ye

Abstract:Motion artifact reduction is one of the important research topics in MR imaging, as the motion artifact degrades image quality and makes diagnosis difficult. Recently, many deep learning approaches have been studied for motion artifact reduction. Unfortunately, most existing models are trained in a supervised manner, requiring paired motion-corrupted and motion-free images, or are based on a strict motion-corruption model, which limits their use for real-world situations. To address this issue, here we present an annealed score-based diffusion model for MRI motion artifact reduction. Specifically, we train a score-based model using only motion-free images, and then motion artifacts are removed by applying forward and reverse diffusion processes repeatedly to gradually impose a low-frequency data consistency. Experimental results verify that the proposed method successfully reduces both simulated and in vivo motion artifacts, outperforming the state-of-the-art deep learning methods.

Via

Access Paper or Ask Questions

MS-DINO: Efficient Distributed Training of Vision Transformer Foundation Model in Medical Domain through Masked Sampling

Jan 05, 2023

Sangjoon Park, Ik-Jae Lee, Jun Won Kim, Jong Chul Ye

Figure 1 for MS-DINO: Efficient Distributed Training of Vision Transformer Foundation Model in Medical Domain through Masked Sampling

Figure 2 for MS-DINO: Efficient Distributed Training of Vision Transformer Foundation Model in Medical Domain through Masked Sampling

Figure 3 for MS-DINO: Efficient Distributed Training of Vision Transformer Foundation Model in Medical Domain through Masked Sampling

Figure 4 for MS-DINO: Efficient Distributed Training of Vision Transformer Foundation Model in Medical Domain through Masked Sampling

Abstract:In spite of the recent success of deep learning in the medical domain, the problem of data scarcity in the medical domain gets aggravated due to privacy and data ownership issues. Distributed learning approaches including federated learning have been studied to alleviate the problems, but they suffer from cumbersome communication overheads and weakness in privacy protection. To address this, here we propose a self-supervised masked sampling distillation method for vision transformer that can be performed without continuous communication but still enhance privacy using a vision transformer-specific encryption method. The effectiveness of our method is demonstrated with extensive experiments on two medical domain data and two different downstream tasks, showing superior performances than those obtained with the existing distributed learning strategy as well as the fine-tuning only baseline. As the self-supervised model built with the proposed method is capable of having a general semantic understanding of the modality, we demonstrate its potential as a task-agnostic foundation model for various medical tasks, widening the applicability in the medical domain.

Via

Access Paper or Ask Questions

Solving 3D Inverse Problems using Pre-trained 2D Diffusion Models

Nov 19, 2022

Hyungjin Chung, Dohoon Ryu, Michael T. McCann, Marc L. Klasky, Jong Chul Ye

Figure 1 for Solving 3D Inverse Problems using Pre-trained 2D Diffusion Models

Figure 2 for Solving 3D Inverse Problems using Pre-trained 2D Diffusion Models

Figure 3 for Solving 3D Inverse Problems using Pre-trained 2D Diffusion Models

Figure 4 for Solving 3D Inverse Problems using Pre-trained 2D Diffusion Models

Abstract:Diffusion models have emerged as the new state-of-the-art generative model with high quality samples, with intriguing properties such as mode coverage and high flexibility. They have also been shown to be effective inverse problem solvers, acting as the prior of the distribution, while the information of the forward model can be granted at the sampling stage. Nonetheless, as the generative process remains in the same high dimensional (i.e. identical to data dimension) space, the models have not been extended to 3D inverse problems due to the extremely high memory and computational cost. In this paper, we combine the ideas from the conventional model-based iterative reconstruction with the modern diffusion models, which leads to a highly effective method for solving 3D medical image reconstruction tasks such as sparse-view tomography, limited angle tomography, compressed sensing MRI from pre-trained 2D diffusion models. In essence, we propose to augment the 2D diffusion prior with a model-based prior in the remaining direction at test time, such that one can achieve coherent reconstructions across all dimensions. Our method can be run in a single commodity GPU, and establishes the new state-of-the-art, showing that the proposed method can perform reconstructions of high fidelity and accuracy even in the most extreme cases (e.g. 2-view 3D tomography). We further reveal that the generalization capacity of the proposed method is surprisingly high, and can be used to reconstruct volumes that are entirely different from the training dataset.

* 14 pages, 10 figures

Via

Access Paper or Ask Questions

Molecular Structure-Property Co-Trained Foundation Model for In Silico Chemistry

Nov 19, 2022

Jinho Chang, Jong Chul Ye

Abstract:Recently, deep learning approaches have been extensively studied for various problems in chemistry, such as virtual screening, de novo molecule design, etc. Despite the impressive successes, end-to-end training for specific tasks usually requires separately designed networks, so it's often difficult to acquire a unified principle to synergistically combine existing architectures and training datasets for novel tasks. To address this, inspired by recent advances of pre-trained multi-modal foundation models such as Vision-Language Pretrained models (VLP), here we present a novel multimodal foundation model that can be used {\em in silico} for various downstream tasks in chemistry. Specifically, our framework, dubbed as the structure-property multi-modal (SPMM) foundation model, is based on the dual-stream transformer with X-shape attention, so that it can align the molecule structure and the chemical properties in a common embedding space. Accordingly, SPMM can simultaneously perform chemical property prediction from given structure-describing strings and allows the generation of molecular structures for given chemical properties, which was previously not possible with a single architecture. Furthermore, we show that the outstanding unimodal representation of a molecule emerges from multimodal learning, which has the potential to be fine-tuned for many other downstream tasks.

Via

Access Paper or Ask Questions