Alert button
Picture for Taesung Park

Taesung Park

Alert button

Expressive Text-to-Image Generation with Rich Text

Apr 13, 2023
Songwei Ge, Taesung Park, Jun-Yan Zhu, Jia-Bin Huang

Figure 1 for Expressive Text-to-Image Generation with Rich Text
Figure 2 for Expressive Text-to-Image Generation with Rich Text
Figure 3 for Expressive Text-to-Image Generation with Rich Text
Figure 4 for Expressive Text-to-Image Generation with Rich Text

Plain text has become a prevalent interface for text-to-image synthesis. However, its limited customization options hinder users from accurately describing desired outputs. For example, plain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Furthermore, creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret. To address these challenges, we propose using a rich-text editor supporting formats such as font style, size, color, and footnote. We extract each word's attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis. We achieve these capabilities through a region-based diffusion process. We first obtain each word's region based on cross-attention maps of a vanilla diffusion process using plain text. For each region, we enforce its text attributes by creating region-specific detailed prompts and applying region-specific guidance. We present various examples of image generation from rich text and demonstrate that our method outperforms strong baselines with quantitative evaluations.

* Project webpage: https://rich-text-to-image.github.io/ 
Viaarxiv icon

Scaling up GANs for Text-to-Image Synthesis

Mar 09, 2023
Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, Taesung Park

Figure 1 for Scaling up GANs for Text-to-Image Synthesis
Figure 2 for Scaling up GANs for Text-to-Image Synthesis
Figure 3 for Scaling up GANs for Text-to-Image Synthesis
Figure 4 for Scaling up GANs for Text-to-Image Synthesis

The recent success of text-to-image synthesis has taken the world by storm and captured the general public's imagination. From a technical standpoint, it also marked a drastic change in the favored architecture to design generative image models. GANs used to be the de facto choice, with techniques like StyleGAN. With DALL-E 2, auto-regressive and diffusion models became the new standard for large-scale generative models overnight. This rapid shift raises a fundamental question: can we scale up GANs to benefit from large datasets like LAION? We find that na\"Ively increasing the capacity of the StyleGAN architecture quickly becomes unstable. We introduce GigaGAN, a new GAN architecture that far exceeds this limit, demonstrating GANs as a viable option for text-to-image synthesis. GigaGAN offers three major advantages. First, it is orders of magnitude faster at inference time, taking only 0.13 seconds to synthesize a 512px image. Second, it can synthesize high-resolution images, for example, 16-megapixel pixels in 3.66 seconds. Finally, GigaGAN supports various latent space editing applications such as latent interpolation, style mixing, and vector arithmetic operations.

* CVPR 2023. Project webpage at https://mingukkang.github.io/GigaGAN/ 
Viaarxiv icon

Domain Expansion of Image Generators

Jan 12, 2023
Yotam Nitzan, Michaël Gharbi, Richard Zhang, Taesung Park, Jun-Yan Zhu, Daniel Cohen-Or, Eli Shechtman

Figure 1 for Domain Expansion of Image Generators
Figure 2 for Domain Expansion of Image Generators
Figure 3 for Domain Expansion of Image Generators
Figure 4 for Domain Expansion of Image Generators

Can one inject new concepts into an already trained generative model, while respecting its existing structure and knowledge? We propose a new task - domain expansion - to address this. Given a pretrained generator and novel (but related) domains, we expand the generator to jointly model all domains, old and new, harmoniously. First, we note the generator contains a meaningful, pretrained latent space. Is it possible to minimally perturb this hard-earned representation, while maximally representing the new domains? Interestingly, we find that the latent space offers unused, "dormant" directions, which do not affect the output. This provides an opportunity: By "repurposing" these directions, we can represent new domains without perturbing the original representation. In fact, we find that pretrained generators have the capacity to add several - even hundreds - of new domains! Using our expansion method, one "expanded" model can supersede numerous domain-specific models, without expanding the model size. Additionally, a single expanded generator natively supports smooth transitions between domains, as well as composition of domains. Code and project page available at https://yotamnitzan.github.io/domain-expansion/.

* Project Page and code are available at https://yotamnitzan.github.io/domain-expansion/ 
Viaarxiv icon

ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions

May 24, 2022
Difan Liu, Sandesh Shetty, Tobias Hinz, Matthew Fisher, Richard Zhang, Taesung Park, Evangelos Kalogerakis

Figure 1 for ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions
Figure 2 for ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions
Figure 3 for ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions
Figure 4 for ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions

We present ASSET, a neural architecture for automatically modifying an input high-resolution image according to a user's edits on its semantic segmentation map. Our architecture is based on a transformer with a novel attention mechanism. Our key idea is to sparsify the transformer's attention matrix at high resolutions, guided by dense attention extracted at lower image resolutions. While previous attention mechanisms are computationally too expensive for handling high-resolution images or are overly constrained within specific image regions hampering long-range interactions, our novel attention mechanism is both computationally efficient and effective. Our sparsified attention mechanism is able to capture long-range interactions and context, leading to synthesizing interesting phenomena in scenes, such as reflections of landscapes onto water or flora consistent with the rest of the landscape, that were not possible to generate reliably with previous convnets and transformer approaches. We present qualitative and quantitative results, along with user studies, demonstrating the effectiveness of our method.

* SIGGRAPH 2022 - Journal Track 
Viaarxiv icon

BlobGAN: Spatially Disentangled Scene Representations

May 05, 2022
Dave Epstein, Taesung Park, Richard Zhang, Eli Shechtman, Alexei A. Efros

Figure 1 for BlobGAN: Spatially Disentangled Scene Representations
Figure 2 for BlobGAN: Spatially Disentangled Scene Representations
Figure 3 for BlobGAN: Spatially Disentangled Scene Representations
Figure 4 for BlobGAN: Spatially Disentangled Scene Representations

We propose an unsupervised, mid-level representation for a generative model of scenes. The representation is mid-level in that it is neither per-pixel nor per-image; rather, scenes are modeled as a collection of spatial, depth-ordered "blobs" of features. Blobs are differentiably placed onto a feature grid that is decoded into an image by a generative adversarial network. Due to the spatial uniformity of blobs and the locality inherent to convolution, our network learns to associate different blobs with different entities in a scene and to arrange these blobs to capture scene layout. We demonstrate this emergent behavior by showing that, despite training without any supervision, our method enables applications such as easy manipulation of objects within a scene (e.g., moving, removing, and restyling furniture), creation of feasible scenes given constraints (e.g., plausible rooms with drawers at a particular location), and parsing of real-world images into constituent parts. On a challenging multi-category dataset of indoor scenes, BlobGAN outperforms StyleGAN2 in image quality as measured by FID. See our project page for video results and interactive demo: http://www.dave.ml/blobgan

* Project webpage available at http://www.dave.ml/blobgan 
Viaarxiv icon

Contrastive Feature Loss for Image Prediction

Nov 12, 2021
Alex Andonian, Taesung Park, Bryan Russell, Phillip Isola, Jun-Yan Zhu, Richard Zhang

Figure 1 for Contrastive Feature Loss for Image Prediction
Figure 2 for Contrastive Feature Loss for Image Prediction
Figure 3 for Contrastive Feature Loss for Image Prediction
Figure 4 for Contrastive Feature Loss for Image Prediction

Training supervised image synthesis models requires a critic to compare two images: the ground truth to the result. Yet, this basic functionality remains an open problem. A popular line of approaches uses the L1 (mean absolute error) loss, either in the pixel or the feature space of pretrained deep networks. However, we observe that these losses tend to produce overly blurry and grey images, and other techniques such as GANs need to be employed to fight these artifacts. In this work, we introduce an information theory based approach to measuring similarity between two images. We argue that a good reconstruction should have high mutual information with the ground truth. This view enables learning a lightweight critic to "calibrate" a feature space in a contrastive manner, such that reconstructions of corresponding spatial patches are brought together, while other patches are repulsed. We show that our formulation immediately boosts the perceptual realism of output images when used as a drop-in replacement for the L1 loss, with or without an additional GAN loss.

* Appeared in Advances in Image Manipulation Workshop at ICCV 2021. GitHub: https://github.com/alexandonian/contrastive-feature-loss 
Viaarxiv icon

A Customizable Dynamic Scenario Modeling and Data Generation Platform for Autonomous Driving

Nov 30, 2020
Jay Shenoy, Edward Kim, Xiangyu Yue, Taesung Park, Daniel Fremont, Alberto Sangiovanni-Vincentelli, Sanjit Seshia

Figure 1 for A Customizable Dynamic Scenario Modeling and Data Generation Platform for Autonomous Driving
Figure 2 for A Customizable Dynamic Scenario Modeling and Data Generation Platform for Autonomous Driving
Figure 3 for A Customizable Dynamic Scenario Modeling and Data Generation Platform for Autonomous Driving

Safely interacting with humans is a significant challenge for autonomous driving. The performance of this interaction depends on machine learning-based modules of an autopilot, such as perception, behavior prediction, and planning. These modules require training datasets with high-quality labels and a diverse range of realistic dynamic behaviors. Consequently, training such modules to handle rare scenarios is difficult because they are, by definition, rarely represented in real-world datasets. Hence, there is a practical need to augment datasets with synthetic data covering these rare scenarios. In this paper, we present a platform to model dynamic and interactive scenarios, generate the scenarios in simulation with different modalities of labeled sensor data, and collect this information for data augmentation. To our knowledge, this is the first integrated platform for these tasks specialized to the autonomous driving domain.

Viaarxiv icon

Contrastive Learning for Unpaired Image-to-Image Translation

Aug 20, 2020
Taesung Park, Alexei A. Efros, Richard Zhang, Jun-Yan Zhu

Figure 1 for Contrastive Learning for Unpaired Image-to-Image Translation
Figure 2 for Contrastive Learning for Unpaired Image-to-Image Translation
Figure 3 for Contrastive Learning for Unpaired Image-to-Image Translation
Figure 4 for Contrastive Learning for Unpaired Image-to-Image Translation

In image-to-image translation, each patch in the output should reflect the content of the corresponding patch in the input, independent of domain. We propose a straightforward method for doing so -- maximizing mutual information between the two, using a framework based on contrastive learning. The method encourages two elements (corresponding patches) to map to a similar point in a learned feature space, relative to other elements (other patches) in the dataset, referred to as negatives. We explore several critical design choices for making contrastive learning effective in the image synthesis setting. Notably, we use a multilayer, patch-based approach, rather than operate on entire images. Furthermore, we draw negatives from within the input image itself, rather than from the rest of the dataset. We demonstrate that our framework enables one-sided translation in the unpaired image-to-image translation setting, while improving quality and reducing training time. In addition, our method can even be extended to the training setting where each "domain" is only a single image.

* ECCV 2020. Please visit https://taesungp.github.io/ContrastiveUnpairedTranslation/ for introduction videos and more. v3 contains typo fixes and citation update 
Viaarxiv icon