Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chang Xu

Investigating Memorization in Video Diffusion Models

Oct 29, 2024

Chen Chen, Enhuai Liu, Daochang Liu, Mubarak Shah, Chang Xu

Figure 1 for Investigating Memorization in Video Diffusion Models

Figure 2 for Investigating Memorization in Video Diffusion Models

Figure 3 for Investigating Memorization in Video Diffusion Models

Figure 4 for Investigating Memorization in Video Diffusion Models

Abstract:Diffusion models, widely used for image and video generation, face a significant limitation: the risk of memorizing and reproducing training data during inference, potentially generating unauthorized copyrighted content. While prior research has focused on image diffusion models (IDMs), video diffusion models (VDMs) remain underexplored. To address this gap, we first formally define the two types of memorization in VDMs (content memorization and motion memorization) in a practical way that focuses on privacy preservation and applies to all generation types. We then introduce new metrics specifically designed to separately assess content and motion memorization in VDMs. Additionally, we curate a dataset of text prompts that are most prone to triggering memorization when used as conditioning in VDMs. By leveraging these prompts, we generate diverse videos from various open-source VDMs, successfully extracting numerous training videos from each tested model. Through the application of our proposed metrics, we systematically analyze memorization across various pretrained VDMs, including text-conditional and unconditional models, on a variety of datasets. Our comprehensive study reveals that memorization is widespread across all tested VDMs, indicating that VDMs can also memorize image training data in addition to video datasets. Finally, we propose efficient and effective detection strategies for both content and motion memorization, offering a foundational approach for improving privacy in VDMs.

* Preprint

Via

Access Paper or Ask Questions

Exploring Local Memorization in Diffusion Models via Bright Ending Attention

Oct 29, 2024

Chen Chen, Daochang Liu, Mubarak Shah, Chang Xu

Abstract:In this paper, we identify and leverage a novel `bright ending' (BE) anomaly in diffusion models prone to memorizing training images to address a new task: locating localized memorization regions within these models. BE refers to a distinct cross-attention pattern observed in text-to-image generations using diffusion models. Specifically, memorized image patches exhibit significantly greater attention to the end token during the final inference step compared to non-memorized patches. This attention map effectively highlights regions where the generated image replicates training data. Furthermore, driven by our observation that local memorization significantly underperforms in existing tasks of measuring, detecting, and mitigating memorization in diffusion models compared to global memorization, we propose a simple yet effective method to integrate BE and the results of the new localization task into these existing frameworks. This integration effectively improves their performances by narrowing the performance gap caused by local memorization. Our results not only demonstrate the successful execution of the new localization task but also establish new state-of-the-art performance across all existing tasks, underscoring the significance of the BE phenomenon.

* Preprint

Via

Access Paper or Ask Questions

Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Model

Oct 24, 2024

Jinxu Lin, Linwei Tao, Minjing Dong, Chang Xu

Figure 1 for Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Model

Figure 2 for Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Model

Figure 3 for Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Model

Figure 4 for Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Model

Abstract:As diffusion models become increasingly popular, the misuse of copyrighted and private images has emerged as a major concern. One promising solution to mitigate this issue is identifying the contribution of specific training samples in generative models, a process known as data attribution. Existing data attribution methods for diffusion models typically quantify the contribution of a training sample by evaluating the change in diffusion loss when the sample is included or excluded from the training process. However, we argue that the direct usage of diffusion loss cannot represent such a contribution accurately due to the calculation of diffusion loss. Specifically, these approaches measure the divergence between predicted and ground truth distributions, which leads to an indirect comparison between the predicted distributions and cannot represent the variances between model behaviors. To address these issues, we aim to measure the direct comparison between predicted distributions with an attribution score to analyse the training sample importance, which is achieved by Diffusion Attribution Score (DAS). Underpinned by rigorous theoretical analysis, we elucidate the effectiveness of DAS. Additionally, we explore strategies to accelerate DAS calculations, facilitating its application to large-scale diffusion models. Our extensive experiments across various datasets and diffusion models demonstrate that DAS significantly surpasses previous benchmarks in terms of the linear data-modelling score, establishing new state-of-the-art performance.

Via

Access Paper or Ask Questions

Consistency Calibration: Improving Uncertainty Calibration via Consistency among Perturbed Neighbors

Oct 16, 2024

Linwei Tao, Haolan Guo, Minjing Dong, Chang Xu

Abstract:Calibration is crucial in deep learning applications, especially in fields like healthcare and autonomous driving, where accurate confidence estimates are vital for decision-making. However, deep neural networks often suffer from miscalibration, with reliability diagrams and Expected Calibration Error (ECE) being the only standard perspective for evaluating calibration performance. In this paper, we introduce the concept of consistency as an alternative perspective on model calibration, inspired by uncertainty estimation literature in large language models (LLMs). We highlight its advantages over the traditional reliability-based view. Building on this concept, we propose a post-hoc calibration method called Consistency Calibration (CC), which adjusts confidence based on the model's consistency across perturbed inputs. CC is particularly effective in locally uncertainty estimation, as it requires no additional data samples or label information, instead generating input perturbations directly from the source data. Moreover, we show that performing perturbations at the logit level significantly improves computational efficiency. We validate the effectiveness of CC through extensive comparisons with various post-hoc and training-time calibration methods, demonstrating state-of-the-art performance on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet, as well as on long-tailed datasets like ImageNet-LT.

Via

Access Paper or Ask Questions

Training-free Long Video Generation with Chain of Diffusion Model Experts

Aug 27, 2024

Wenhao Li, Yichao Cao, Xiu Su, Xi Lin, Shan You, Mingkai Zheng, Yi Chen, Chang Xu

Abstract:Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to high complexity of video generation task. In this paper, we propose \textbf{ConFiner}, an efficient high-quality video generation framework that decouples video generation into easier subtasks: structure \textbf{con}trol and spatial-temporal re\textbf{fine}ment. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts' capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only 10\% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.

Via

Access Paper or Ask Questions

FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation

Aug 26, 2024

Daixun Li, Weiying Xie, Mingxiang Cao, Yunke Wang, Jiaqing Zhang, Yunsong Li, Leyuan Fang, Chang Xu

Figure 1 for FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation

Figure 2 for FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation

Figure 3 for FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation

Figure 4 for FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation

Abstract:Multimodal image fusion and segmentation enhance scene understanding in autonomous driving by integrating data from various sensors. However, current models struggle to efficiently segment densely packed elements in such scenes, due to the absence of comprehensive fusion features that can guide mid-process fine-tuning and focus attention on relevant areas. The Segment Anything Model (SAM) has emerged as a transformative segmentation method. It provides more effective prompts through its flexible prompt encoder, compared to transformers lacking fine-tuned control. Nevertheless, SAM has not been extensively studied in the domain of multimodal fusion for natural images. In this paper, we introduce SAM into multimodal image segmentation for the first time, proposing a novel framework that combines Latent Space Token Generation (LSTG) and Fusion Mask Prompting (FMP) modules to enhance SAM's multimodal fusion and segmentation capabilities. Specifically, we first obtain latent space features of the two modalities through vector quantization and embed them into a cross-attention-based inter-domain fusion module to establish long-range dependencies between modalities. Then, we use these comprehensive fusion features as prompts to guide precise pixel-level segmentation. Extensive experiments on several public datasets demonstrate that the proposed method significantly outperforms SAM and SAM2 in multimodal autonomous driving scenarios, achieving at least 3.9$\%$ higher segmentation mIoU than the state-of-the-art approaches.

Via

Access Paper or Ask Questions

Compress Guidance in Conditional Diffusion Sampling

Aug 20, 2024

Anh-Dung Dinh, Daochang Liu, Chang Xu

Abstract:Enforcing guidance throughout the entire sampling process often proves counterproductive due to the model-fitting issue., where samples are generated to match the classifier's parameters rather than generalizing the expected condition. This work identifies and quantifies the problem, demonstrating that reducing or excluding guidance at numerous timesteps can mitigate this issue. By distributing the guidance densely in the early stages of the process, we observe a significant improvement in image quality and diversity while also reducing the required guidance timesteps by nearly 40%. This approach addresses a major challenge in applying guidance effectively to generative tasks. Consequently, our proposed method, termed Compress Guidance, allows for the exclusion of a substantial number of guidance timesteps while still surpassing baseline models in image quality. We validate our approach through benchmarks on label conditional and text-to-image generative tasks across various datasets and models.

* 10 pages, 5 figures, Computer Vision and Machine Learning

Via

Access Paper or Ask Questions

Efficient Image-to-Image Diffusion Classifier for Adversarial Robustness

Aug 16, 2024

Hefei Mei, Minjing Dong, Chang Xu

Abstract:Diffusion models (DMs) have demonstrated great potential in the field of adversarial robustness, where DM-based defense methods can achieve superior defense capability without adversarial training. However, they all require huge computational costs due to the usage of large-scale pre-trained DMs, making it difficult to conduct full evaluation under strong attacks and compare with traditional CNN-based methods. Simply reducing the network size and timesteps in DMs could significantly harm the image generation quality, which invalidates previous frameworks. To alleviate this issue, we redesign the diffusion framework from generating high-quality images to predicting distinguishable image labels. Specifically, we employ an image translation framework to learn many-to-one mapping from input samples to designed orthogonal image labels. Based on this framework, we introduce an efficient Image-to-Image diffusion classifier with a pruned U-Net structure and reduced diffusion timesteps. Besides the framework, we redesign the optimization objective of DMs to fit the target of image classification, where a new classification loss is incorporated in the DM-based image translation framework to distinguish the generated label from those of other classes. We conduct sufficient evaluations of the proposed classifier under various attacks on popular benchmarks. Extensive experiments show that our method achieves better adversarial robustness with fewer computational costs than DM-based and CNN-based methods. The code is available at https://github.com/hfmei/IDC.

Via

Access Paper or Ask Questions

VSSD: Vision Mamba with Non-Causal State Space Duality

Aug 04, 2024

Yuheng Shi, Minjing Dong, Mingjia Li, Chang Xu

Figure 1 for VSSD: Vision Mamba with Non-Causal State Space Duality

Figure 2 for VSSD: Vision Mamba with Non-Causal State Space Duality

Figure 3 for VSSD: Vision Mamba with Non-Causal State Space Duality

Figure 4 for VSSD: Vision Mamba with Non-Causal State Space Duality

Abstract:Vision transformers have significantly advanced the field of computer vision, offering robust modeling capabilities and global receptive field. However, their high computational demands limit their applicability in processing long sequences. To tackle this issue, State Space Models (SSMs) have gained prominence in vision tasks as they offer linear computational complexity. Recently, State Space Duality (SSD), an improved variant of SSMs, was introduced in Mamba2 to enhance model performance and efficiency. However, the inherent causal nature of SSD/SSMs restricts their applications in non-causal vision tasks. To address this limitation, we introduce Visual State Space Duality (VSSD) model, which has a non-causal format of SSD. Specifically, we propose to discard the magnitude of interactions between the hidden state and tokens while preserving their relative weights, which relieves the dependencies of token contribution on previous tokens. Together with the involvement of multi-scan strategies, we show that the scanning results can be integrated to achieve non-causality, which not only improves the performance of SSD in vision tasks but also enhances its efficiency. We conduct extensive experiments on various benchmarks including image classification, detection, and segmentation, where VSSD surpasses existing state-of-the-art SSM-based models. Code and weights are available at \url{https://github.com/YuHengsss/VSSD}.

* 16 pages, 5 figures, 7 tables

Via

Access Paper or Ask Questions

VSSD: Vision Mamba with Non-Casual State Space Duality

Jul 26, 2024

Yuheng Shi, Minjing Dong, Mingjia Li, Chang Xu

Figure 1 for VSSD: Vision Mamba with Non-Casual State Space Duality

Figure 2 for VSSD: Vision Mamba with Non-Casual State Space Duality

Figure 3 for VSSD: Vision Mamba with Non-Casual State Space Duality

Figure 4 for VSSD: Vision Mamba with Non-Casual State Space Duality

* 16 pages, 5 figures, 7 tables

Via

Access Paper or Ask Questions