Abstract:Text-to-image diffusion models excel at generating single-instance scenes but struggle with multi-instance scenarios, often merging or omitting objects. Unlike previous training-free approaches that rely solely on semantic-level guidance without addressing instance individuation, our training-free method, Instance-to-Semantic Attention Control (ISAC), explicitly resolves incomplete instance formation and semantic entanglement through an instance-first modeling approach. This enables ISAC to effectively leverage a hierarchical, tree-structured prompt mechanism, disentangling multiple object instances and individually aligning them with their corresponding semantic labels. Without employing any external models, ISAC achieves up to 52% average multi-class accuracy and 83% average multi-instance accuracy by effectively forming disentangled instances. The code will be made available upon publication.
Abstract:Despite recent advances in diffusion models, achieving reliable image generation and editing remains challenging due to the inherent diversity induced by stochastic noise in the sampling process. Instruction-guided image editing with diffusion models offers user-friendly capabilities, yet editing failures, such as background distortion, frequently occur. Users often resort to trial and error, adjusting seeds or prompts to achieve satisfactory results, which is inefficient. While seed selection methods exist for Text-to-Image (T2I) generation, they depend on external verifiers, limiting applicability, and evaluating multiple seeds increases computational complexity. To address this, we first establish a multiple-seed-based image editing baseline using background consistency scores, achieving Best-of-N performance without supervision. Building on this, we introduce ELECT (Early-timestep Latent Evaluation for Candidate Selection), a zero-shot framework that selects reliable seeds by estimating background mismatches at early diffusion timesteps, identifying the seed that retains the background while modifying only the foreground. ELECT ranks seed candidates by a background inconsistency score, filtering unsuitable samples early based on background consistency while preserving editability. Beyond standalone seed selection, ELECT integrates into instruction-guided editing pipelines and extends to Multimodal Large-Language Models (MLLMs) for joint seed and prompt selection, further improving results when seed selection alone is insufficient. Experiments show that ELECT reduces computational costs (by 41 percent on average and up to 61 percent) while improving background consistency and instruction adherence, achieving around 40 percent success rates in previously failed cases - without any external supervision or training.
Abstract:Achieving precise panoptic segmentation relies on pixel-wise instance annotations, but obtaining such datasets is costly. Unsupervised instance segmentation (UIS) eliminates annotation requirements but struggles with adjacent instance merging and single-instance fragmentation, largely due to the limitations of DINO-based backbones which lack strong instance separation cues. Weakly-supervised panoptic segmentation (WPS) reduces annotation costs using sparse labels (e.g., points, boxes), yet these annotations remain expensive and introduce human bias and boundary errors. To address these challenges, we propose DiffEGG (Diffusion-Driven EdGe Generation), a fully annotation-free method that extracts instance-aware features from pretrained diffusion models to generate precise instance edge maps. Unlike DINO-based UIS methods, diffusion models inherently capture fine-grained, instance-aware features, enabling more precise boundary delineation. For WPS, DiffEGG eliminates annotation costs and human bias by operating without any form of manual supervision, addressing the key limitations of prior best methods. Additionally, we introduce RIP, a post-processing technique that fuses DiffEGG's edge maps with segmentation masks in a task-agnostic manner. RIP allows DiffEGG to be seamlessly integrated into various segmentation frameworks. When applied to UIS, DiffEGG and RIP achieve an average $+4.4\text{ AP}$ improvement over prior best UIS methods. When combined with weakly-supervised semantic segmentation (WSS), DiffEGG enables WPS without instance annotations, outperforming prior best point-supervised WPS methods by $+1.7\text{ PQ}$. These results demonstrate that DiffEGG's edge maps serve as a cost-effective, annotation-free alternative to instance annotations, significantly improving segmentation without human intervention. Code is available at https://github.com/shjo-april/DiffEGG.
Abstract:Linker generation is critical in drug discovery applications such as lead optimization and PROTAC design, where molecular fragments are assembled into diverse drug candidates. Existing methods fall into PC-Free and PC-Aware categories based on their use of 3D point clouds (PC). PC-Free models prioritize diversity but suffer from lower validity due to overlooking PC constraints, while PC-Aware models ensure higher validity but restrict diversity by enforcing strict PC constraints. To overcome these trade-offs without additional training, we propose HybridLinker, a framework that enhances PC-Aware inference by providing diverse bonding topologies from a pretrained PC-Free model as guidance. At its core, we propose LinkerDPS, the first diffusion posterior sampling (DPS) method operating across PC-Free and PC-Aware spaces, bridging molecular topology with 3D point clouds via an energy-inspired function. By transferring the diverse sampling distribution of PC-Free models into the PC-Aware distribution, HybridLinker significantly and consistently surpasses baselines, improving both validity and diversity in foundational molecular design and applied property optimization tasks, establishing a new DPS framework in the molecular and graph domains beyond imaging.