Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shaochong Jia

An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Mar 25, 2024

Zizhao Hu, Shaochong Jia, Mohammad Rostami

Figure 1 for An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Figure 2 for An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Figure 3 for An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Figure 4 for An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Abstract:Diffusion models have been widely used for conditional data cross-modal generation tasks such as text-to-image and text-to-video. However, state-of-the-art models still fail to align the generated visual concepts with high-level semantics in a language such as object count, spatial relationship, etc. We approach this problem from a multimodal data fusion perspective and investigate how different fusion strategies can affect vision-language alignment. We discover that compared to the widely used early fusion of conditioning text in a pretrained image feature space, a specially designed intermediate fusion can: (i) boost text-to-image alignment with improved generation quality and (ii) improve training and inference efficiency by reducing low-rank text-to-image attention calculations. We perform experiments using a text-to-image generation task on the MS-COCO dataset. We compare our intermediate fusion mechanism with the classic early fusion mechanism on two common conditioning methods on a U-shaped ViT backbone. Our intermediate fusion model achieves a higher CLIP Score and lower FID, with 20% reduced FLOPs, and 50% increased training speed compared to a strong U-ViT baseline with an early fusion.

Via

Access Paper or Ask Questions

Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net

Nov 28, 2023

Zizhao Hu, Shaochong Jia, Mohammad Rostami

Figure 1 for Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net

Figure 2 for Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net

Figure 3 for Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net

Figure 4 for Efficient Multimodal Diffusion Models Using Joint Data Infilling with Partially Shared U-Net

Abstract:Recently, diffusion models have been used successfully to fit distributions for cross-modal data translation and multimodal data generation. However, these methods rely on extensive scaling, overlooking the inefficiency and interference between modalities. We develop Partially Shared U-Net (PS-U-Net) architecture which is an efficient multimodal diffusion model that allows text and image inputs to pass through dedicated layers and skip-connections for preserving modality-specific fine-grained details. Inspired by image inpainting, we also propose a new efficient multimodal sampling method that introduces new scenarios for conditional generation while only requiring a simple joint distribution to be learned. Our empirical exploration of the MS-COCO dataset demonstrates that our method generates multimodal text and image data with higher quality compared to existing multimodal diffusion models while having a comparable size, faster training, faster multimodal sampling, and more flexible generation.

Via

Access Paper or Ask Questions