Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mingrui Zhu

Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models

Aug 26, 2025

Chang Wang, Siyu Yan, Depeng Yuan, Yuqi Chen, Yanhua Huang, Yuanhang Zheng, Shuhao Li, Yinqi Zhang, Kedi Chen, Mingrui Zhu(+1 more)

Figure 1 for Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models

Figure 2 for Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models

Figure 3 for Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models

Figure 4 for Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models

Abstract:The generation of ad headlines plays a vital role in modern advertising, where both quality and diversity are essential to engage a broad range of audience segments. Current approaches primarily optimize language models for headline quality or click-through rates (CTR), often overlooking the need for diversity and resulting in homogeneous outputs. To address this limitation, we propose DIVER, a novel framework based on large language models (LLMs) that are jointly optimized for both diversity and quality. We first design a semantic- and stylistic-aware data generation pipeline that automatically produces high-quality training pairs with ad content and multiple diverse headlines. To achieve the goal of generating high-quality and diversified ad headlines within a single forward pass, we propose a multi-stage multi-objective optimization framework with supervised fine-tuning (SFT) and reinforcement learning (RL). Experiments on real-world industrial datasets demonstrate that DIVER effectively balances quality and diversity. Deployed on a large-scale content-sharing platform serving hundreds of millions of users, our framework improves advertiser value (ADVV) and CTR by 4.0% and 1.4%.

Via

Access Paper or Ask Questions

Effective Diffusion Transformer Architecture for Image Super-Resolution

Sep 29, 2024

Kun Cheng, Lei Yu, Zhijun Tu, Xiao He, Liyu Chen, Yong Guo, Mingrui Zhu, Nannan Wang, Xinbo Gao, Jie Hu

Figure 1 for Effective Diffusion Transformer Architecture for Image Super-Resolution

Figure 2 for Effective Diffusion Transformer Architecture for Image Super-Resolution

Figure 3 for Effective Diffusion Transformer Architecture for Image Super-Resolution

Figure 4 for Effective Diffusion Transformer Architecture for Image Super-Resolution

Abstract:Recent advances indicate that diffusion models hold great promise in image super-resolution. While the latest methods are primarily based on latent diffusion models with convolutional neural networks, there are few attempts to explore transformers, which have demonstrated remarkable performance in image generation. In this work, we design an effective diffusion transformer for image super-resolution (DiT-SR) that achieves the visual quality of prior-based methods, but through a training-from-scratch manner. In practice, DiT-SR leverages an overall U-shaped architecture, and adopts a uniform isotropic design for all the transformer blocks across different stages. The former facilitates multi-scale hierarchical feature extraction, while the latter reallocates the computational resources to critical layers to further enhance performance. Moreover, we thoroughly analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module, enhancing the model's capacity to process distinct frequency information at different time steps. Extensive experiments demonstrate that DiT-SR outperforms the existing training-from-scratch diffusion-based SR methods significantly, and even beats some of the prior-based methods on pretrained Stable Diffusion, proving the superiority of diffusion transformer in image super-resolution.

* Code is available at https://github.com/kunncheng/DiT-SR

Via

Access Paper or Ask Questions

Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models

Aug 26, 2024

Chaohua Shi, Xuan Wang, Si Shi, Xule Wang, Mingrui Zhu, Nannan Wang, Xinbo Gao

Figure 1 for Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models

Figure 2 for Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models

Figure 3 for Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models

Figure 4 for Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models

Abstract:Food image composition requires the use of existing dish images and background images to synthesize a natural new image, while diffusion models have made significant advancements in image generation, enabling the construction of end-to-end architectures that yield promising results. However, existing diffusion models face challenges in processing and fusing information from multiple images and lack access to high-quality publicly available datasets, which prevents the application of diffusion models in food image composition. In this paper, we introduce a large-scale, high-quality food image composite dataset, FC22k, which comprises 22,000 foreground, background, and ground truth ternary image pairs. Additionally, we propose a novel food image composition method, Foodfusion, which leverages the capabilities of the pre-trained diffusion models and incorporates a Fusion Module for processing and integrating foreground and background information. This fused information aligns the foreground features with the background structure by merging the global structural information at the cross-attention layer of the denoising UNet. To further enhance the content and structure of the background, we also integrate a Content-Structure Control Module. Extensive experiments demonstrate the effectiveness and scalability of our proposed method.

* 14 pages

Via

Access Paper or Ask Questions

One Step Diffusion-based Super-Resolution with Time-Aware Distillation

Aug 14, 2024

Xiao He, Huaao Tang, Zhijun Tu, Junchao Zhang, Kun Cheng, Hanting Chen, Yong Guo, Mingrui Zhu, Nannan Wang, Xinbo Gao(+1 more)

Abstract:Diffusion-based image super-resolution (SR) methods have shown promise in reconstructing high-resolution images with fine details from low-resolution counterparts. However, these approaches typically require tens or even hundreds of iterative samplings, resulting in significant latency. Recently, techniques have been devised to enhance the sampling efficiency of diffusion-based SR models via knowledge distillation. Nonetheless, when aligning the knowledge of student and teacher models, these solutions either solely rely on pixel-level loss constraints or neglect the fact that diffusion models prioritize varying levels of information at different time steps. To accomplish effective and efficient image super-resolution, we propose a time-aware diffusion distillation method, named TAD-SR. Specifically, we introduce a novel score distillation strategy to align the data distribution between the outputs of the student and teacher models after minor noise perturbation. This distillation strategy enables the student network to concentrate more on the high-frequency details. Furthermore, to mitigate performance limitations stemming from distillation, we integrate a latent adversarial loss and devise a time-aware discriminator that leverages diffusion priors to effectively distinguish between real images and generated images. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method achieves comparable or even superior performance compared to both previous state-of-the-art (SOTA) methods and the teacher model in just one sampling step. Codes are available at https://github.com/LearningHx/TAD-SR.

* 18 pages

Via

Access Paper or Ask Questions

InstructBrush: Learning Attention-based Instruction Optimization for Image Editing

Mar 27, 2024

Ruoyu Zhao, Qingnan Fan, Fei Kou, Shuai Qin, Hong Gu, Wei Wu, Pengcheng Xu, Mingrui Zhu, Nannan Wang, Xinbo Gao

Figure 1 for InstructBrush: Learning Attention-based Instruction Optimization for Image Editing

Figure 2 for InstructBrush: Learning Attention-based Instruction Optimization for Image Editing

Figure 3 for InstructBrush: Learning Attention-based Instruction Optimization for Image Editing

Figure 4 for InstructBrush: Learning Attention-based Instruction Optimization for Image Editing

Abstract:In recent years, instruction-based image editing methods have garnered significant attention in image editing. However, despite encompassing a wide range of editing priors, these methods are helpless when handling editing tasks that are challenging to accurately describe through language. We propose InstructBrush, an inversion method for instruction-based image editing methods to bridge this gap. It extracts editing effects from exemplar image pairs as editing instructions, which are further applied for image editing. Two key techniques are introduced into InstructBrush, Attention-based Instruction Optimization and Transformation-oriented Instruction Initialization, to address the limitations of the previous method in terms of inversion effects and instruction generalization. To explore the ability of instruction inversion methods to guide image editing in open scenarios, we establish a TransformationOriented Paired Benchmark (TOP-Bench), which contains a rich set of scenes and editing types. The creation of this benchmark paves the way for further exploration of instruction inversion. Quantitatively and qualitatively, our approach achieves superior performance in editing and is more semantically consistent with the target editing effects.

* Project Page: https://royzhao926.github.io/InstructBrush/

Via

Access Paper or Ask Questions

Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors

Jan 29, 2024

Shiyin Dong, Mingrui Zhu, Kun Cheng, Nannan Wang, Xinbo Gao

Figure 1 for Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors

Figure 2 for Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors

Figure 3 for Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors

Figure 4 for Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors

Abstract:The remarkable prowess of diffusion models in image generation has spurred efforts to extend their application beyond generative tasks. However, a persistent challenge exists in lacking a unified approach to apply diffusion models to visual perception tasks with diverse semantic granularity requirements. Our purpose is to establish a unified visual perception framework, capitalizing on the potential synergies between generative and discriminative models. In this paper, we propose Vermouth, a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors. Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages. We emphasize that there is no necessity for incorporating a heavyweight or intricate decoder to transform diffusion models into potent representation learners. Extensive comparative evaluations against tailored discriminative models showcase the efficacy of our approach on zero-shot sketch-based image retrieval (ZS-SBIR), few-shot classification, and open-vocabulary semantic segmentation tasks. The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.

* 18 pages,11 figures

Via

Access Paper or Ask Questions

CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

Nov 30, 2023

Ruoyu Zhao, Mingrui Zhu, Shiyin Dong, Nannan Wang, Xinbo Gao

Figure 1 for CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

Figure 2 for CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

Figure 3 for CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

Figure 4 for CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

Abstract:We propose CatVersion, an inversion-based method that learns the personalized concept through a handful of examples. Subsequently, users can utilize text prompts to generate images that embody the personalized concept, thereby achieving text-to-image personalization. In contrast to existing approaches that emphasize word embedding learning or parameter fine-tuning for the diffusion model, which potentially causes concept dilution or overfitting, our method concatenates embeddings on the feature-dense space of the text encoder in the diffusion model to learn the gap between the personalized concept and its base class, aiming to maximize the preservation of prior knowledge in diffusion models while restoring the personalized concepts. To this end, we first dissect the text encoder's integration in the image generation process to identify the feature-dense space of the encoder. Afterward, we concatenate embeddings on the Keys and Values in this space to learn the gap between the personalized concept and its base class. In this way, the concatenated embeddings ultimately manifest as a residual on the original attention output. To more accurately and unbiasedly quantify the results of personalized image generation, we improve the CLIP image alignment score based on masks. Qualitatively and quantitatively, CatVersion helps to restore personalization concepts more faithfully and enables more robust editing.

* For the project page, please visit https://royzhao926.github.io/CatVersion-page/

Via

Access Paper or Ask Questions

HFORD: High-Fidelity and Occlusion-Robust De-identification for Face Privacy Protection

Nov 15, 2023

Dongxin Chen, Mingrui Zhu, Nannan Wang, Xinbo Gao

Figure 1 for HFORD: High-Fidelity and Occlusion-Robust De-identification for Face Privacy Protection

Figure 2 for HFORD: High-Fidelity and Occlusion-Robust De-identification for Face Privacy Protection

Figure 3 for HFORD: High-Fidelity and Occlusion-Robust De-identification for Face Privacy Protection

Figure 4 for HFORD: High-Fidelity and Occlusion-Robust De-identification for Face Privacy Protection

Abstract:With the popularity of smart devices and the development of computer vision technology, concerns about face privacy protection are growing. The face de-identification technique is a practical way to solve the identity protection problem. The existing facial de-identification methods have revealed several problems, including the impact on the realism of anonymized results when faced with occlusions and the inability to maintain identity-irrelevant details in anonymized results. We present a High-Fidelity and Occlusion-Robust De-identification (HFORD) method to deal with these issues. This approach can disentangle identities and attributes while preserving image-specific details such as background, facial features (e.g., wrinkles), and lighting, even in occluded scenes. To disentangle the latent codes in the GAN inversion space, we introduce an Identity Disentanglement Module (IDM). This module selects the latent codes that are closely related to the identity. It further separates the latent codes into identity-related codes and attribute-related codes, enabling the network to preserve attributes while only modifying the identity. To ensure the preservation of image details and enhance the network's robustness to occlusions, we propose an Attribute Retention Module (ARM). This module adaptively preserves identity-irrelevant details and facial occlusions and blends them into the generated results in a modulated manner. Extensive experiments show that our method has higher quality, better detail fidelity, and stronger occlusion robustness than other face de-identification methods.

Via

Access Paper or Ask Questions

Diff-Privacy: Diffusion-based Face Privacy Protection

Sep 11, 2023

Xiao He, Mingrui Zhu, Dongxin Chen, Nannan Wang, Xinbo Gao

Figure 1 for Diff-Privacy: Diffusion-based Face Privacy Protection

Figure 2 for Diff-Privacy: Diffusion-based Face Privacy Protection

Figure 3 for Diff-Privacy: Diffusion-based Face Privacy Protection

Figure 4 for Diff-Privacy: Diffusion-based Face Privacy Protection

Abstract:Privacy protection has become a top priority as the proliferation of AI techniques has led to widespread collection and misuse of personal data. Anonymization and visual identity information hiding are two important facial privacy protection tasks that aim to remove identification characteristics from facial images at the human perception level. However, they have a significant difference in that the former aims to prevent the machine from recognizing correctly, while the latter needs to ensure the accuracy of machine recognition. Therefore, it is difficult to train a model to complete these two tasks simultaneously. In this paper, we unify the task of anonymization and visual identity information hiding and propose a novel face privacy protection method based on diffusion models, dubbed Diff-Privacy. Specifically, we train our proposed multi-scale image inversion module (MSI) to obtain a set of SDM format conditional embeddings of the original image. Based on the conditional embeddings, we design corresponding embedding scheduling strategies and construct different energy functions during the denoising process to achieve anonymization and visual identity information hiding. Extensive experiments have been conducted to validate the effectiveness of our proposed framework in protecting facial privacy.

* 17pages

Via

Access Paper or Ask Questions

Adapt and Align to Improve Zero-Shot Sketch-Based Image Retrieval

May 18, 2023

Shiyin Dong, Mingrui Zhu, Nannan Wang, Heng Yang, Xinbo Gao

Figure 1 for Adapt and Align to Improve Zero-Shot Sketch-Based Image Retrieval

Figure 2 for Adapt and Align to Improve Zero-Shot Sketch-Based Image Retrieval

Figure 3 for Adapt and Align to Improve Zero-Shot Sketch-Based Image Retrieval

Figure 4 for Adapt and Align to Improve Zero-Shot Sketch-Based Image Retrieval

Abstract:Zero-shot sketch-based image retrieval (ZS-SBIR) is challenging due to the cross-domain nature of sketches and photos, as well as the semantic gap between seen and unseen image distributions. Previous methods fine-tune pre-trained models with various side information and learning strategies to learn a compact feature space that is shared between the sketch and photo domains and bridges seen and unseen classes. However, these efforts are inadequate in adapting domains and transferring knowledge from seen to unseen classes. In this paper, we present an effective ``Adapt and Align'' approach to address the key challenges. Specifically, we insert simple and lightweight domain adapters to learn new abstract concepts of the sketch domain and improve cross-domain representation capabilities. Inspired by recent advances in image-text foundation models (e.g., CLIP) on zero-shot scenarios, we explicitly align the learned image embedding with a more semantic text embedding to achieve the desired knowledge transfer from seen to unseen classes. Extensive experiments on three benchmark datasets and two popular backbones demonstrate the superiority of our method in terms of retrieval accuracy and flexibility.

* 13 pages, 8 figures, 6 tables

Via

Access Paper or Ask Questions