Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ning Yu

Infinite-Resolution Integral Noise Warping for Diffusion Models

Nov 02, 2024

Yitong Deng, Winnie Lin, Lingxiao Li, Dmitriy Smirnov, Ryan Burgert, Ning Yu, Vincent Dedun, Mohammad H. Taghavi

Abstract:Adapting pretrained image-based diffusion models to generate temporally consistent videos has become an impactful generative modeling research direction. Training-free noise-space manipulation has proven to be an effective technique, where the challenge is to preserve the Gaussian white noise distribution while adding in temporal consistency. Recently, Chang et al. (2024) formulated this problem using an integral noise representation with distribution-preserving guarantees, and proposed an upsampling-based algorithm to compute it. However, while their mathematical formulation is advantageous, the algorithm incurs a high computational cost. Through analyzing the limiting-case behavior of their algorithm as the upsampling resolution goes to infinity, we develop an alternative algorithm that, by gathering increments of multiple Brownian bridges, achieves their infinite-resolution accuracy while simultaneously reducing the computational cost by orders of magnitude. We prove and experimentally validate our theoretical claims, and demonstrate our method's effectiveness in real-world applications. We further show that our method readily extends to the 3-dimensional space.

Via

Access Paper or Ask Questions

DifFRelight: Diffusion-Based Facial Performance Relighting

Oct 10, 2024

Mingming He, Pascal Clausen, Ahmet Levent Taşel, Li Ma, Oliver Pilarski, Wenqi Xian, Laszlo Rikker, Xueming Yu, Ryan Burgert, Ning Yu(+1 more)

Figure 1 for DifFRelight: Diffusion-Based Facial Performance Relighting

Figure 2 for DifFRelight: Diffusion-Based Facial Performance Relighting

Figure 3 for DifFRelight: Diffusion-Based Facial Performance Relighting

Figure 4 for DifFRelight: Diffusion-Based Facial Performance Relighting

Abstract:We present a novel framework for free-viewpoint facial performance relighting using diffusion-based image-to-image translation. Leveraging a subject-specific dataset containing diverse facial expressions captured under various lighting conditions, including flat-lit and one-light-at-a-time (OLAT) scenarios, we train a diffusion model for precise lighting control, enabling high-fidelity relit facial images from flat-lit inputs. Our framework includes spatially-aligned conditioning of flat-lit captures and random noise, along with integrated lighting information for global control, utilizing prior knowledge from the pre-trained Stable Diffusion model. This model is then applied to dynamic facial performances captured in a consistent flat-lit environment and reconstructed for novel-view synthesis using a scalable dynamic 3D Gaussian Splatting method to maintain quality and consistency in the relit results. In addition, we introduce unified lighting control by integrating a novel area lighting representation with directional lighting, allowing for joint adjustments in light size and direction. We also enable high dynamic range imaging (HDRI) composition using multiple directional lights to produce dynamic sequences under complex lighting conditions. Our evaluations demonstrate the models efficiency in achieving precise lighting control and generalizing across various facial expressions while preserving detailed features such as skintexture andhair. The model accurately reproduces complex lighting effects like eye reflections, subsurface scattering, self-shadowing, and translucency, advancing photorealism within our framework.

* 18 pages, SIGGRAPH Asia 2024 Conference Papers (SA Conference Papers '24), December 3--6, 2024, Tokyo, Japan. Project page: https://www.eyelinestudios.com/research/diffrelight.html

Via

Access Paper or Ask Questions

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Sep 29, 2024

Chen Yeh, You-Ming Chang, Wei-Chen Chiu, Ning Yu

Figure 1 for T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Figure 2 for T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Figure 3 for T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Figure 4 for T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

Abstract:To address the risks of encountering inappropriate or harmful content, researchers managed to incorporate several harmful contents datasets with machine learning methods to detect harmful concepts. However, existing harmful datasets are curated by the presence of a narrow range of harmful objects, and only cover real harmful content sources. This hinders the generalizability of methods based on such datasets, potentially leading to misjudgments. Therefore, we propose a comprehensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models, across a total of 10 harmful categories covering a full spectrum of harmful concepts with nontrivial definition. We also propose a novel annotation framework by formulating the annotation process as a multi-agent Visual Question Answering (VQA) task, having 3 different VLMs "debate" about whether the given image/video is harmful, and incorporating the in-context learning strategy in the debating process. Therefore, we can ensure that the VLMs consider the context of the given image/video and both sides of the arguments thoroughly before making decisions, further reducing the likelihood of misjudgments in edge cases. Evaluation and experimental results demonstrate that (1) the great alignment between the annotation from our novel annotation framework and those from human, ensuring the reliability of VHD11K; (2) our full-spectrum harmful dataset successfully identifies the inability of existing harmful content detection methods to detect extensive harmful contents and improves the performance of existing harmfulness recognition methods; (3) VHD11K outperforms the baseline dataset, SMID, as evidenced by the superior improvement in harmfulness recognition methods. The complete dataset and code can be found at https://github.com/nctu-eva-lab/VHD11K.

Via

Access Paper or Ask Questions

Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders

Aug 20, 2024

Yuan Xin, Zheng Li, Ning Yu, Dingfan Chen, Mario Fritz, Michael Backes, Yang Zhang

Figure 1 for Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders

Figure 2 for Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders

Figure 3 for Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders

Figure 4 for Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders

Abstract:Despite being prevalent in the general field of Natural Language Processing (NLP), pre-trained language models inherently carry privacy and copyright concerns due to their nature of training on large-scale web-scraped data. In this paper, we pioneer a systematic exploration of such risks associated with pre-trained language encoders, specifically focusing on the membership leakage of pre-training data exposed through downstream models adapted from pre-trained language encoders-an aspect largely overlooked in existing literature. Our study encompasses comprehensive experiments across four types of pre-trained encoder architectures, three representative downstream tasks, and five benchmark datasets. Intriguingly, our evaluations reveal, for the first time, the existence of membership leakage even when only the black-box output of the downstream model is exposed, highlighting a privacy risk far greater than previously assumed. Alongside, we present in-depth analysis and insights toward guiding future researchers and practitioners in addressing the privacy considerations in developing pre-trained language models.

* ECAI24

Via

Access Paper or Ask Questions

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Aug 16, 2024

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo(+17 more)

Figure 1 for xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Figure 2 for xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Figure 3 for xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Figure 4 for xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Abstract:This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.

Via

Access Paper or Ask Questions

Membership Inference Attack Against Masked Image Modeling

Aug 13, 2024

Zheng Li, Xinlei He, Ning Yu, Yang Zhang

Figure 1 for Membership Inference Attack Against Masked Image Modeling

Figure 2 for Membership Inference Attack Against Masked Image Modeling

Figure 3 for Membership Inference Attack Against Masked Image Modeling

Figure 4 for Membership Inference Attack Against Masked Image Modeling

Abstract:Masked Image Modeling (MIM) has achieved significant success in the realm of self-supervised learning (SSL) for visual recognition. The image encoder pre-trained through MIM, involving the masking and subsequent reconstruction of input images, attains state-of-the-art performance in various downstream vision tasks. However, most existing works focus on improving the performance of MIM.In this work, we take a different angle by studying the pre-training data privacy of MIM. Specifically, we propose the first membership inference attack against image encoders pre-trained by MIM, which aims to determine whether an image is part of the MIM pre-training dataset. The key design is to simulate the pre-training paradigm of MIM, i.e., image masking and subsequent reconstruction, and then obtain reconstruction errors. These reconstruction errors can serve as membership signals for achieving attack goals, as the encoder is more capable of reconstructing the input image in its training set with lower errors. Extensive evaluations are conducted on three model architectures and three benchmark datasets. Empirical results show that our attack outperforms baseline methods. Additionally, we undertake intricate ablation studies to analyze multiple factors that could influence the performance of the attack.

Via

Access Paper or Ask Questions

Jailbreaking Text-to-Image Models with LLM-Based Agents

Aug 01, 2024

Yingkai Dong, Zheng Li, Xiangtao Meng, Ning Yu, Shanqing Guo

Figure 1 for Jailbreaking Text-to-Image Models with LLM-Based Agents

Figure 2 for Jailbreaking Text-to-Image Models with LLM-Based Agents

Figure 3 for Jailbreaking Text-to-Image Models with LLM-Based Agents

Figure 4 for Jailbreaking Text-to-Image Models with LLM-Based Agents

Abstract:Recent advancements have significantly improved automated task-solving capabilities using autonomous agents powered by large language models (LLMs). However, most LLM-based agents focus on dialogue, programming, or specialized domains, leaving gaps in addressing generative AI safety tasks. These gaps are primarily due to the challenges posed by LLM hallucinations and the lack of clear guidelines. In this paper, we propose Atlas, an advanced LLM-based multi-agent framework that integrates an efficient fuzzing workflow to target generative AI models, specifically focusing on jailbreak attacks against text-to-image (T2I) models with safety filters. Atlas utilizes a vision-language model (VLM) to assess whether a prompt triggers the T2I model's safety filter. It then iteratively collaborates with both LLM and VLM to generate an alternative prompt that bypasses the filter. Atlas also enhances the reasoning abilities of LLMs in attack scenarios by leveraging multi-agent communication, in-context learning (ICL) memory mechanisms, and the chain-of-thought (COT) approach. Our evaluation demonstrates that Atlas successfully jailbreaks several state-of-the-art T2I models in a black-box setting, which are equipped with multi-modal safety filters. In addition, Atlas outperforms existing methods in both query efficiency and the quality of the generated images.

Via

Access Paper or Ask Questions

JIGMARK: A Black-Box Approach for Enhancing Image Watermarks against Diffusion Model Edits

Jun 06, 2024

Minzhou Pan, Yi Zeng, Xue Lin, Ning Yu, Cho-Jui Hsieh, Peter Henderson, Ruoxi Jia

Figure 1 for JIGMARK: A Black-Box Approach for Enhancing Image Watermarks against Diffusion Model Edits

Figure 2 for JIGMARK: A Black-Box Approach for Enhancing Image Watermarks against Diffusion Model Edits

Figure 3 for JIGMARK: A Black-Box Approach for Enhancing Image Watermarks against Diffusion Model Edits

Figure 4 for JIGMARK: A Black-Box Approach for Enhancing Image Watermarks against Diffusion Model Edits

Abstract:In this study, we investigate the vulnerability of image watermarks to diffusion-model-based image editing, a challenge exacerbated by the computational cost of accessing gradient information and the closed-source nature of many diffusion models. To address this issue, we introduce JIGMARK. This first-of-its-kind watermarking technique enhances robustness through contrastive learning with pairs of images, processed and unprocessed by diffusion models, without needing a direct backpropagation of the diffusion process. Our evaluation reveals that JIGMARK significantly surpasses existing watermarking solutions in resilience to diffusion-model edits, demonstrating a True Positive Rate more than triple that of leading baselines at a 1% False Positive Rate while preserving image quality. At the same time, it consistently improves the robustness against other conventional perturbations (like JPEG, blurring, etc.) and malicious watermark attacks over the state-of-the-art, often by a large margin. Furthermore, we propose the Human Aligned Variation (HAV) score, a new metric that surpasses traditional similarity measures in quantifying the number of image derivatives from image editing.

Via

Access Paper or Ask Questions

Detecting Adversarial Data via Perturbation Forgery

May 25, 2024

Qian Wang, Chen Li, Yuchen Luo, Hefei Ling, Ping Li, Jiazhong Chen, Shijuan Huang, Ning Yu

Figure 1 for Detecting Adversarial Data via Perturbation Forgery

Figure 2 for Detecting Adversarial Data via Perturbation Forgery

Figure 3 for Detecting Adversarial Data via Perturbation Forgery

Figure 4 for Detecting Adversarial Data via Perturbation Forgery

Abstract:As a defense strategy against adversarial attacks, adversarial detection aims to identify and filter out adversarial data from the data flow based on discrepancies in distribution and noise patterns between natural and adversarial data. Although previous detection methods achieve high performance in detecting gradient-based adversarial attacks, new attacks based on generative models with imbalanced and anisotropic noise patterns evade detection. Even worse, existing techniques either necessitate access to attack data before deploying a defense or incur a significant time cost for inference, rendering them impractical for defending against newly emerging attacks that are unseen by defenders. In this paper, we explore the proximity relationship between adversarial noise distributions and demonstrate the existence of an open covering for them. By learning to distinguish this open covering from the distribution of natural data, we can develop a detector with strong generalization capabilities against all types of adversarial attacks. Based on this insight, we heuristically propose Perturbation Forgery, which includes noise distribution perturbation, sparse mask generation, and pseudo-adversarial data production, to train an adversarial detector capable of detecting unseen gradient-based, generative-model-based, and physical adversarial attacks, while remaining agnostic to any specific models. Comprehensive experiments conducted on multiple general and facial datasets, with a wide spectrum of attacks, validate the strong generalization of our method.

Via

Access Paper or Ask Questions

FASTTRACK: Fast and Accurate Fact Tracing for LLMs

Apr 22, 2024

Si Chen, Feiyang Kang, Ning Yu, Ruoxi Jia

Figure 1 for FASTTRACK: Fast and Accurate Fact Tracing for LLMs

Figure 2 for FASTTRACK: Fast and Accurate Fact Tracing for LLMs

Figure 3 for FASTTRACK: Fast and Accurate Fact Tracing for LLMs

Figure 4 for FASTTRACK: Fast and Accurate Fact Tracing for LLMs

Abstract:Fact tracing seeks to identify specific training examples that serve as the knowledge source for a given query. Existing approaches to fact tracing rely on assessing the similarity between each training sample and the query along a certain dimension, such as lexical similarity, gradient, or embedding space. However, these methods fall short of effectively distinguishing between samples that are merely relevant and those that actually provide supportive evidence for the information sought by the query. This limitation often results in suboptimal effectiveness. Moreover, these approaches necessitate the examination of the similarity of individual training points for each query, imposing significant computational demands and creating a substantial barrier for practical applications. This paper introduces FASTTRACK, a novel approach that harnesses the capabilities of Large Language Models (LLMs) to validate supportive evidence for queries and at the same time clusters the training database towards a reduced extent for LLMs to trace facts. Our experiments show that FASTTRACK substantially outperforms existing methods in both accuracy and efficiency, achieving more than 100\% improvement in F1 score over the state-of-the-art methods while being X33 faster than \texttt{TracIn}.

Via

Access Paper or Ask Questions