Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Richard Zhang

Customizing Text-to-Image Diffusion with Camera Viewpoint Control

Apr 18, 2024

Nupur Kumari, Grace Su, Richard Zhang, Taesung Park, Eli Shechtman, Jun-Yan Zhu

Figure 1 for Customizing Text-to-Image Diffusion with Camera Viewpoint Control

Figure 2 for Customizing Text-to-Image Diffusion with Camera Viewpoint Control

Figure 3 for Customizing Text-to-Image Diffusion with Camera Viewpoint Control

Figure 4 for Customizing Text-to-Image Diffusion with Camera Viewpoint Control

Abstract:Model customization introduces new concepts to existing text-to-image models, enabling the generation of the new concept in novel contexts. However, such methods lack accurate camera view control w.r.t the object, and users must resort to prompt engineering (e.g., adding "top-view") to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of camera viewpoint for model customization. This allows us to modify object properties amongst various background scenes via text prompts, all while incorporating the target camera pose as additional control. This new task presents significant challenges in merging a 3D representation from the multi-view images of the new concept with a general, 2D text-to-image model. To bridge this gap, we propose to condition the 2D diffusion process on rendered, view-dependent features of the new object. During training, we jointly adapt the 2D diffusion modules and 3D feature predictions to reconstruct the object's appearance and geometry while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model personalization baselines in preserving the custom object's identity while following the input text prompt and the object's camera pose.

* project page: https://customdiffusion360.github.io

Via

Access Paper or Ask Questions

VideoGigaGAN: Towards Detail-rich Video Super-Resolution

Apr 18, 2024

Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, Difan Liu

Figure 1 for VideoGigaGAN: Towards Detail-rich Video Super-Resolution

Figure 2 for VideoGigaGAN: Towards Detail-rich Video Super-Resolution

Figure 3 for VideoGigaGAN: Towards Detail-rich Video Super-Resolution

Figure 4 for VideoGigaGAN: Towards Detail-rich Video Super-Resolution

Abstract:Video super-resolution (VSR) approaches have shown impressive temporal consistency in upsampled videos. However, these approaches tend to generate blurrier results than their image counterparts as they are limited in their generative capability. This raises a fundamental question: can we extend the success of a generative image upsampler to the VSR task while preserving the temporal consistency? We introduce VideoGigaGAN, a new generative VSR model that can produce videos with high-frequency details and temporal consistency. VideoGigaGAN builds upon a large-scale image upsampler -- GigaGAN. Simply inflating GigaGAN to a video model by adding temporal modules produces severe temporal flickering. We identify several key issues and propose techniques that significantly improve the temporal consistency of upsampled videos. Our experiments show that, unlike previous VSR methods, VideoGigaGAN generates temporally consistent videos with more fine-grained appearance details. We validate the effectiveness of VideoGigaGAN by comparing it with state-of-the-art VSR models on public datasets and showcasing video results with $8\times$ super-resolution.

* project page: https://videogigagan.github.io/

Via

Access Paper or Ask Questions

Jump Cut Smoothing for Talking Heads

Jan 11, 2024

Xiaojuan Wang, Taesung Park, Yang Zhou, Eli Shechtman, Richard Zhang

Figure 1 for Jump Cut Smoothing for Talking Heads

Figure 2 for Jump Cut Smoothing for Talking Heads

Figure 3 for Jump Cut Smoothing for Talking Heads

Figure 4 for Jump Cut Smoothing for Talking Heads

Abstract:A jump cut offers an abrupt, sometimes unwanted change in the viewing experience. We present a novel framework for smoothing these jump cuts, in the context of talking head videos. We leverage the appearance of the subject from the other source frames in the video, fusing it with a mid-level representation driven by DensePose keypoints and face landmarks. To achieve motion, we interpolate the keypoints and landmarks between the end frames around the cut. We then use an image translation network from the keypoints and source frames, to synthesize pixels. Because keypoints can contain errors, we propose a cross-modal attention scheme to select and pick the most appropriate source amongst multiple options for each key point. By leveraging this mid-level representation, our method can achieve stronger results than a strong video interpolation baseline. We demonstrate our method on various jump cuts in the talking head videos, such as cutting filler words, pauses, and even random cuts. Our experiments show that we can achieve seamless transitions, even in the challenging cases where the talking head rotates or moves drastically in the jump cut.

* Correct typos in the caption of Figure 1; Change the project website address. Project page: https://jeanne-wang.github.io/jumpcutsmoothing/

Via

Access Paper or Ask Questions

Customizing Motion in Text-to-Video Diffusion Models

Dec 07, 2023

Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba, Richard Zhang, Bryan Russell

Figure 1 for Customizing Motion in Text-to-Video Diffusion Models

Figure 2 for Customizing Motion in Text-to-Video Diffusion Models

Figure 3 for Customizing Motion in Text-to-Video Diffusion Models

Figure 4 for Customizing Motion in Text-to-Video Diffusion Models

Abstract:We introduce an approach for augmenting text-to-video generation models with customized motions, extending their capabilities beyond the motions depicted in the original training data. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios. Our contributions are threefold. First, to achieve our results, we finetune an existing text-to-video model to learn a novel mapping between the depicted motion in the input examples to a new unique token. To avoid overfitting to the new custom motion, we introduce an approach for regularization over videos. Second, by leveraging the motion priors in a pretrained model, our method can produce novel videos featuring multiple people doing the custom motion, and can invoke the motion in combination with other motions. Furthermore, our approach extends to the multimodal customization of motion and appearance of individualized subjects, enabling the generation of videos featuring unique characters and distinct motions. Third, to validate our method, we introduce an approach for quantitatively evaluating the learned custom motion and perform a systematic ablation study. We show that our method significantly outperforms prior appearance-based customization approaches when extended to the motion customization task.

* Project page: this website https://joaanna.github.io/customizing_motion/

Via

Access Paper or Ask Questions

One-step Diffusion with Distribution Matching Distillation

Dec 05, 2023

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, Taesung Park

Figure 1 for One-step Diffusion with Distribution Matching Distillation

Figure 2 for One-step Diffusion with Distribution Matching Distillation

Figure 3 for One-step Diffusion with Distribution Matching Distillation

Figure 4 for One-step Diffusion with Distribution Matching Distillation

Abstract:Diffusion models generate high-quality images but require dozens of forward passes. We introduce Distribution Matching Distillation (DMD), a procedure to transform a diffusion model into a one-step image generator with minimal impact on image quality. We enforce the one-step image generator match the diffusion model at distribution level, by minimizing an approximate KL divergence whose gradient can be expressed as the difference between 2 score functions, one of the target distribution and the other of the synthetic distribution being produced by our one-step generator. The score functions are parameterized as two diffusion models trained separately on each distribution. Combined with a simple regression loss matching the large-scale structure of the multi-step diffusion outputs, our method outperforms all published few-step diffusion approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot COCO-30k, comparable to Stable Diffusion but orders of magnitude faster. Utilizing FP16 inference, our model generates images at 20 FPS on modern hardware.

* Project page: https://tianweiy.github.io/dmd/

Via

Access Paper or Ask Questions

Online Detection of AI-Generated Images

Oct 23, 2023

David C. Epstein, Ishan Jain, Oliver Wang, Richard Zhang

Figure 1 for Online Detection of AI-Generated Images

Figure 2 for Online Detection of AI-Generated Images

Figure 3 for Online Detection of AI-Generated Images

Figure 4 for Online Detection of AI-Generated Images

Abstract:With advancements in AI-generated images coming on a continuous basis, it is increasingly difficult to distinguish traditionally-sourced images (e.g., photos, artwork) from AI-generated ones. Previous detection methods study the generalization from a single generator to another in isolation. However, in reality, new generators are released on a streaming basis. We study generalization in this setting, training on N models and testing on the next (N+k), following the historical release dates of well-known generation methods. Furthermore, images increasingly consist of both real and generated components, for example through image inpainting. Thus, we extend this approach to pixel prediction, demonstrating strong performance using automatically-generated inpainted data. In addition, for settings where commercial models are not publicly available for automatic data generation, we evaluate if pixel detectors can be trained solely on whole synthetic images.

* ICCV DeepFake Analysis and Detection Workshop, 2023

Via

Access Paper or Ask Questions

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

Jun 26, 2023

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, Phillip Isola

Abstract:Current perceptual similarity metrics operate at the level of pixels and patches. These metrics compare images in terms of their low-level colors and textures, but fail to capture mid-level similarities and differences in image layout, object pose, and semantic content. In this paper, we develop a perceptual metric that assesses images holistically. Our first step is to collect a new dataset of human similarity judgments over image pairs that are alike in diverse ways. Critical to this dataset is that judgments are nearly automatic and shared by all observers. To achieve this we use recent text-to-image models to create synthetic pairs that are perturbed along various dimensions. We observe that popular perceptual metrics fall short of explaining our new data, and we introduce a new metric, DreamSim, tuned to better align with human perception. We analyze how our metric is affected by different visual attributes, and find that it focuses heavily on foreground objects and semantic content while also being sensitive to color and layout. Notably, despite being trained on synthetic data, our metric generalizes to real images, giving strong results on retrieval and reconstruction tasks. Furthermore, our metric outperforms both prior learned metrics and recent large vision models on these tasks.

* Website: https://dreamsim-nights.github.io/ Code: https://github.com/ssundaram21/dreamsim; Fixed in-text citation, figure alignment, and typos

Via

Access Paper or Ask Questions

Evaluating Data Attribution for Text-to-Image Models

Jun 15, 2023

Sheng-Yu Wang, Alexei A. Efros, Jun-Yan Zhu, Richard Zhang

Figure 1 for Evaluating Data Attribution for Text-to-Image Models

Figure 2 for Evaluating Data Attribution for Text-to-Image Models

Figure 3 for Evaluating Data Attribution for Text-to-Image Models

Figure 4 for Evaluating Data Attribution for Text-to-Image Models

Abstract:While large text-to-image models are able to synthesize "novel" images, these images are necessarily a reflection of the training data. The problem of data attribution in such models -- which of the images in the training set are most responsible for the appearance of a given generated image -- is a difficult yet important one. As an initial step toward this problem, we evaluate attribution through "customization" methods, which tune an existing large-scale model toward a given exemplar object or style. Our key insight is that this allows us to efficiently create synthetic images that are computationally influenced by the exemplar by construction. With our new dataset of such exemplar-influenced images, we are able to evaluate various data attribution algorithms and different possible feature spaces. Furthermore, by training on our dataset, we can tune standard models, such as DINO, CLIP, and ViT, toward the attribution problem. Even though the procedure is tuned towards small exemplar sets, we show generalization to larger sets. Finally, by taking into account the inherent uncertainty of the problem, we can assign soft attribution scores over a set of training images.

* Project page: https://peterwang512.github.io/GenDataAttribution

Via

Access Paper or Ask Questions

Ablating Concepts in Text-to-Image Diffusion Models

Mar 23, 2023

Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, Jun-Yan Zhu

Figure 1 for Ablating Concepts in Text-to-Image Diffusion Models

Figure 2 for Ablating Concepts in Text-to-Image Diffusion Models

Figure 3 for Ablating Concepts in Text-to-Image Diffusion Models

Figure 4 for Ablating Concepts in Text-to-Image Diffusion Models

Abstract:Large-scale text-to-image diffusion models can generate high-fidelity images with powerful compositional ability. However, these models are typically trained on an enormous amount of Internet data, often containing copyrighted material, licensed images, and personal photos. Furthermore, they have been found to replicate the style of various living artists or memorize exact training samples. How can we remove such copyrighted concepts or images without retraining the model from scratch? To achieve this goal, we propose an efficient method of ablating concepts in the pretrained model, i.e., preventing the generation of a target concept. Our algorithm learns to match the image distribution for a target style, instance, or text prompt we wish to ablate to the distribution corresponding to an anchor concept. This prevents the model from generating target concepts given its text condition. Extensive experiments show that our method can successfully prevent the generation of the ablated concept while preserving closely related concepts in the model.

* project website: https://www.cs.cmu.edu/~concept-ablation/

Via

Access Paper or Ask Questions

Scaling up GANs for Text-to-Image Synthesis

Mar 09, 2023

Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, Taesung Park

Abstract:The recent success of text-to-image synthesis has taken the world by storm and captured the general public's imagination. From a technical standpoint, it also marked a drastic change in the favored architecture to design generative image models. GANs used to be the de facto choice, with techniques like StyleGAN. With DALL-E 2, auto-regressive and diffusion models became the new standard for large-scale generative models overnight. This rapid shift raises a fundamental question: can we scale up GANs to benefit from large datasets like LAION? We find that na\"Ively increasing the capacity of the StyleGAN architecture quickly becomes unstable. We introduce GigaGAN, a new GAN architecture that far exceeds this limit, demonstrating GANs as a viable option for text-to-image synthesis. GigaGAN offers three major advantages. First, it is orders of magnitude faster at inference time, taking only 0.13 seconds to synthesize a 512px image. Second, it can synthesize high-resolution images, for example, 16-megapixel pixels in 3.66 seconds. Finally, GigaGAN supports various latent space editing applications such as latent interpolation, style mixing, and vector arithmetic operations.

* CVPR 2023. Project webpage at https://mingukkang.github.io/GigaGAN/

Via

Access Paper or Ask Questions