Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kwanyoung Lee

Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Mar 19, 2026

Kwanyoung Lee, SeungJu Cha, Yebin Ahn, Hyunwoo Oh, Sungho Koh, Dong-Jin Kim

Abstract:Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) - a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie's identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.

* Accepted in CVPR 2026 (main track). 10 pages, 6 figures; supplementary material included (14 pages, 11 figures)

Via

Access Paper or Ask Questions

ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation

Mar 19, 2026

Kwanyoung Lee, Hyunwoo Oh, SeungJu Cha, Sungho Koh, Dong-Jin Kim

Abstract:Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.

* Accepted in CVPR 2026 (findings). 10 pages, 4 figures; supplementary material included (8 pages, 10 figures)

Via

Access Paper or Ask Questions

ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Oct 29, 2025

Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, Dong-Jin Kim

Figure 1 for ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Figure 2 for ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Figure 3 for ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Figure 4 for ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Abstract:Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.

* NeurIPS 2025. Code: https://github.com/KSH00906/ScaleDiff

Via

Access Paper or Ask Questions

VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

Mar 20, 2025

SeungJu Cha, Kwanyoung Lee, Ye-Chan Kim, Hyunwoo Oh, Dong-Jin Kim

Figure 1 for VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

Figure 2 for VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

Figure 3 for VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

Figure 4 for VerbDiff: Text-Only Diffusion Models with Enhanced Interaction Awareness

Abstract:Recent large-scale text-to-image diffusion models generate photorealistic images but often struggle to accurately depict interactions between humans and objects due to their limited ability to differentiate various interaction words. In this work, we propose VerbDiff to address the challenge of capturing nuanced interactions within text-to-image diffusion models. VerbDiff is a novel text-to-image generation model that weakens the bias between interaction words and objects, enhancing the understanding of interactions. Specifically, we disentangle various interaction words from frequency-based anchor words and leverage localized interaction regions from generated images to help the model better capture semantics in distinctive words without extra conditions. Our approach enables the model to accurately understand the intended interaction between humans and objects, producing high-quality images with accurate interactions aligned with specified verbs. Extensive experiments on the HICO-DET dataset demonstrate the effectiveness of our method compared to previous approaches.

* Accepted at CVPR 2025, code : https://github.com/SeungJuCha/VerbDiff.git

Via

Access Paper or Ask Questions