Alert button
Picture for Zhihong Pan

Zhihong Pan

Alert button

Effective Real Image Editing with Accelerated Iterative Diffusion Inversion

Sep 10, 2023
Zhihong Pan, Riccardo Gherardi, Xiufeng Xie, Stephen Huang

Figure 1 for Effective Real Image Editing with Accelerated Iterative Diffusion Inversion
Figure 2 for Effective Real Image Editing with Accelerated Iterative Diffusion Inversion
Figure 3 for Effective Real Image Editing with Accelerated Iterative Diffusion Inversion
Figure 4 for Effective Real Image Editing with Accelerated Iterative Diffusion Inversion

Despite all recent progress, it is still challenging to edit and manipulate natural images with modern generative models. When using Generative Adversarial Network (GAN), one major hurdle is in the inversion process mapping a real image to its corresponding noise vector in the latent space, since its necessary to be able to reconstruct an image to edit its contents. Likewise for Denoising Diffusion Implicit Models (DDIM), the linearization assumption in each inversion step makes the whole deterministic inversion process unreliable. Existing approaches that have tackled the problem of inversion stability often incur in significant trade-offs in computational efficiency. In this work we propose an Accelerated Iterative Diffusion Inversion method, dubbed AIDI, that significantly improves reconstruction accuracy with minimal additional overhead in space and time complexity. By using a novel blended guidance technique, we show that effective results can be obtained on a large range of image editing tasks without large classifier-free guidance in inversion. Furthermore, when compared with other diffusion inversion based works, our proposed process is shown to be more robust for fast image editing in the 10 and 20 diffusion steps' regimes.

* Accepted to ICCV 2023 (Oral) 
Viaarxiv icon

HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation

Aug 19, 2023
Xiufeng Xie, Riccardo Gherardi, Zhihong Pan, Stephen Huang

Figure 1 for HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation
Figure 2 for HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation
Figure 3 for HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation
Figure 4 for HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation

Neural radiance fields (NeRF) have garnered significant attention, with recent works such as Instant-NGP accelerating NeRF training and evaluation through a combination of hashgrid-based positional encoding and neural networks. However, effectively leveraging the spatial sparsity of 3D scenes remains a challenge. To cull away unnecessary regions of the feature grid, existing solutions rely on prior knowledge of object shape or periodically estimate object shape during training by repeated model evaluations, which are costly and wasteful. To address this issue, we propose HollowNeRF, a novel compression solution for hashgrid-based NeRF which automatically sparsifies the feature grid during the training phase. Instead of directly compressing dense features, HollowNeRF trains a coarse 3D saliency mask that guides efficient feature pruning, and employs an alternating direction method of multipliers (ADMM) pruner to sparsify the 3D saliency mask during training. By exploiting the sparsity in the 3D scene to redistribute hash collisions, HollowNeRF improves rendering quality while using a fraction of the parameters of comparable state-of-the-art solutions, leading to a better cost-accuracy trade-off. Our method delivers comparable rendering quality to Instant-NGP, while utilizing just 31% of the parameters. In addition, our solution can achieve a PSNR accuracy gain of up to 1dB using only 56% of the parameters.

* Accepted to ICCV 2023 
Viaarxiv icon

GBSD: Generative Bokeh with Stage Diffusion

Jun 14, 2023
Jieren Deng, Xin Zhou, Hao Tian, Zhihong Pan, Derek Aguiar

Figure 1 for GBSD: Generative Bokeh with Stage Diffusion
Figure 2 for GBSD: Generative Bokeh with Stage Diffusion
Figure 3 for GBSD: Generative Bokeh with Stage Diffusion
Figure 4 for GBSD: Generative Bokeh with Stage Diffusion

The bokeh effect is an artistic technique that blurs out-of-focus areas in a photograph and has gained interest due to recent developments in text-to-image synthesis and the ubiquity of smart-phone cameras and photo-sharing apps. Prior work on rendering bokeh effects have focused on post hoc image manipulation to produce similar blurring effects in existing photographs using classical computer graphics or neural rendering techniques, but have either depth discontinuity artifacts or are restricted to reproducing bokeh effects that are present in the training data. More recent diffusion based models can synthesize images with an artistic style, but either require the generation of high-dimensional masks, expensive fine-tuning, or affect global image characteristics. In this paper, we present GBSD, the first generative text-to-image model that synthesizes photorealistic images with a bokeh style. Motivated by how image synthesis occurs progressively in diffusion models, our approach combines latent diffusion models with a 2-stage conditioning algorithm to render bokeh effects on semantically defined objects. Since we can focus the effect on objects, this semantic bokeh effect is more versatile than classical rendering techniques. We evaluate GBSD both quantitatively and qualitatively and demonstrate its ability to be applied in both text-to-image and image-to-image settings.

Viaarxiv icon

Fast Diffusion Probabilistic Model Sampling through the lens of Backward Error Analysis

Apr 22, 2023
Yansong Gao, Zhihong Pan, Xin Zhou, Le Kang, Pratik Chaudhari

Figure 1 for Fast Diffusion Probabilistic Model Sampling through the lens of Backward Error Analysis
Figure 2 for Fast Diffusion Probabilistic Model Sampling through the lens of Backward Error Analysis
Figure 3 for Fast Diffusion Probabilistic Model Sampling through the lens of Backward Error Analysis
Figure 4 for Fast Diffusion Probabilistic Model Sampling through the lens of Backward Error Analysis

Denoising diffusion probabilistic models (DDPMs) are a class of powerful generative models. The past few years have witnessed the great success of DDPMs in generating high-fidelity samples. A significant limitation of the DDPMs is the slow sampling procedure. DDPMs generally need hundreds or thousands of sequential function evaluations (steps) of neural networks to generate a sample. This paper aims to develop a fast sampling method for DDPMs requiring much fewer steps while retaining high sample quality. The inference process of DDPMs approximates solving the corresponding diffusion ordinary differential equations (diffusion ODEs) in the continuous limit. This work analyzes how the backward error affects the diffusion ODEs and the sample quality in DDPMs. We propose fast sampling through the \textbf{Restricting Backward Error schedule (RBE schedule)} based on dynamically moderating the long-time backward error. Our method accelerates DDPMs without any further training. Our experiments show that sampling with an RBE schedule generates high-quality samples within only 8 to 20 function evaluations on various benchmark datasets. We achieved 12.01 FID in 8 function evaluations on the ImageNet $128\times128$, and a $20\times$ speedup compared with previous baseline samplers.

* arXiv admin note: text overlap with arXiv:2101.12176 by other authors 
Viaarxiv icon

Raising The Limit Of Image Rescaling Using Auxiliary Encoding

Mar 12, 2023
Chenzhong Yin, Zhihong Pan, Xin Zhou, Le Kang, Paul Bogdan

Figure 1 for Raising The Limit Of Image Rescaling Using Auxiliary Encoding
Figure 2 for Raising The Limit Of Image Rescaling Using Auxiliary Encoding
Figure 3 for Raising The Limit Of Image Rescaling Using Auxiliary Encoding
Figure 4 for Raising The Limit Of Image Rescaling Using Auxiliary Encoding

Normalizing flow models using invertible neural networks (INN) have been widely investigated for successful generative image super-resolution (SR) by learning the transformation between the normal distribution of latent variable $z$ and the conditional distribution of high-resolution (HR) images gave a low-resolution (LR) input. Recently, image rescaling models like IRN utilize the bidirectional nature of INN to push the performance limit of image upscaling by optimizing the downscaling and upscaling steps jointly. While the random sampling of latent variable $z$ is useful in generating diverse photo-realistic images, it is not desirable for image rescaling when accurate restoration of the HR image is more important. Hence, in places of random sampling of $z$, we propose auxiliary encoding modules to further push the limit of image rescaling performance. Two options to store the encoded latent variables in downscaled LR images, both readily supported in existing image file format, are proposed. One is saved as the alpha-channel, the other is saved as meta-data in the image header, and the corresponding modules are denoted as suffixes -A and -M respectively. Optimal network architectural changes are investigated for both options to demonstrate their effectiveness in raising the rescaling performance limit on different baseline models including IRN and DLV-IRN.

Viaarxiv icon

Smooth and Stepwise Self-Distillation for Object Detection

Mar 09, 2023
Jieren Deng, Xin Zhou, Hao Tian, Zhihong Pan, Derek Aguiar

Figure 1 for Smooth and Stepwise Self-Distillation for Object Detection
Figure 2 for Smooth and Stepwise Self-Distillation for Object Detection
Figure 3 for Smooth and Stepwise Self-Distillation for Object Detection
Figure 4 for Smooth and Stepwise Self-Distillation for Object Detection

Distilling the structured information captured in feature maps has contributed to improved results for object detection tasks, but requires careful selection of baseline architectures and substantial pre-training. Self-distillation addresses these limitations and has recently achieved state-of-the-art performance for object detection despite making several simplifying architectural assumptions. Building on this work, we propose Smooth and Stepwise Self-Distillation (SSSD) for object detection. Our SSSD architecture forms an implicit teacher from object labels and a feature pyramid network backbone to distill label-annotated feature maps using Jensen-Shannon distance, which is smoother than distillation losses used in prior work. We additionally add a distillation coefficient that is adaptively configured based on the learning rate. We extensively benchmark SSSD against a baseline and two state-of-the-art object detector architectures on the COCO dataset by varying the coefficients and backbone and detector networks. We demonstrate that SSSD achieves higher average precision in most experimental settings, is robust to a wide range of coefficients, and benefits from our stepwise distillation procedure.

Viaarxiv icon

Extreme Generative Image Compression by Learning Text Embedding from Diffusion Models

Nov 14, 2022
Zhihong Pan, Xin Zhou, Hao Tian

Figure 1 for Extreme Generative Image Compression by Learning Text Embedding from Diffusion Models
Figure 2 for Extreme Generative Image Compression by Learning Text Embedding from Diffusion Models
Figure 3 for Extreme Generative Image Compression by Learning Text Embedding from Diffusion Models
Figure 4 for Extreme Generative Image Compression by Learning Text Embedding from Diffusion Models

Transferring large amount of high resolution images over limited bandwidth is an important but very challenging task. Compressing images using extremely low bitrates (<0.1 bpp) has been studied but it often results in low quality images of heavy artifacts due to the strong constraint in the number of bits available for the compressed data. It is often said that a picture is worth a thousand words but on the other hand, language is very powerful in capturing the essence of an image using short descriptions. With the recent success of diffusion models for text-to-image generation, we propose a generative image compression method that demonstrates the potential of saving an image as a short text embedding which in turn can be used to generate high-fidelity images which is equivalent to the original one perceptually. For a given image, its corresponding text embedding is learned using the same optimization process as the text-to-image diffusion model itself, using a learnable text embedding as input after bypassing the original transformer. The optimization is applied together with a learning compression model to achieve extreme compression of low bitrates <0.1 bpp. Based on our experiments measured by a comprehensive set of image quality metrics, our method outperforms the other state-of-the-art deep learning methods in terms of both perceptual quality and diversity.

Viaarxiv icon

Arbitrary Style Guidance for Enhanced Diffusion-Based Text-to-Image Generation

Nov 14, 2022
Zhihong Pan, Xin Zhou, Hao Tian

Figure 1 for Arbitrary Style Guidance for Enhanced Diffusion-Based Text-to-Image Generation
Figure 2 for Arbitrary Style Guidance for Enhanced Diffusion-Based Text-to-Image Generation
Figure 3 for Arbitrary Style Guidance for Enhanced Diffusion-Based Text-to-Image Generation
Figure 4 for Arbitrary Style Guidance for Enhanced Diffusion-Based Text-to-Image Generation

Diffusion-based text-to-image generation models like GLIDE and DALLE-2 have gained wide success recently for their superior performance in turning complex text inputs into images of high quality and wide diversity. In particular, they are proven to be very powerful in creating graphic arts of various formats and styles. Although current models supported specifying style formats like oil painting or pencil drawing, fine-grained style features like color distributions and brush strokes are hard to specify as they are randomly picked from a conditional distribution based on the given text input. Here we propose a novel style guidance method to support generating images using arbitrary style guided by a reference image. The generation method does not require a separate style transfer model to generate desired styles while maintaining image quality in generated content as controlled by the text input. Additionally, the guidance method can be applied without a style reference, denoted as self style guidance, to generate images of more diverse styles. Comprehensive experiments prove that the proposed method remains robust and effective in a wide range of conditions, including diverse graphic art forms, image content types and diffusion models.

* To appear at WACV 2023 
Viaarxiv icon

Effective Invertible Arbitrary Image Rescaling

Sep 26, 2022
Zhihong Pan, Baopu Li, Dongliang He, Wenhao Wu, Errui Ding

Figure 1 for Effective Invertible Arbitrary Image Rescaling
Figure 2 for Effective Invertible Arbitrary Image Rescaling
Figure 3 for Effective Invertible Arbitrary Image Rescaling
Figure 4 for Effective Invertible Arbitrary Image Rescaling

Great successes have been achieved using deep learning techniques for image super-resolution (SR) with fixed scales. To increase its real world applicability, numerous models have also been proposed to restore SR images with arbitrary scale factors, including asymmetric ones where images are resized to different scales along horizontal and vertical directions. Though most models are only optimized for the unidirectional upscaling task while assuming a predefined downscaling kernel for low-resolution (LR) inputs, recent models based on Invertible Neural Networks (INN) are able to increase upscaling accuracy significantly by optimizing the downscaling and upscaling cycle jointly. However, limited by the INN architecture, it is constrained to fixed integer scale factors and requires one model for each scale. Without increasing model complexity, a simple and effective invertible arbitrary rescaling network (IARN) is proposed to achieve arbitrary image rescaling by training only one model in this work. Using innovative components like position-aware scale encoding and preemptive channel splitting, the network is optimized to convert the non-invertible rescaling cycle to an effectively invertible process. It is shown to achieve a state-of-the-art (SOTA) performance in bidirectional arbitrary rescaling without compromising perceptual quality in LR outputs. It is also demonstrated to perform well on tests with asymmetric scales using the same network architecture.

Viaarxiv icon

Enhancing Image Rescaling using Dual Latent Variables in Invertible Neural Network

Jul 24, 2022
Min Zhang, Zhihong Pan, Xin Zhou, C. -C. Jay Kuo

Figure 1 for Enhancing Image Rescaling using Dual Latent Variables in Invertible Neural Network
Figure 2 for Enhancing Image Rescaling using Dual Latent Variables in Invertible Neural Network
Figure 3 for Enhancing Image Rescaling using Dual Latent Variables in Invertible Neural Network
Figure 4 for Enhancing Image Rescaling using Dual Latent Variables in Invertible Neural Network

Normalizing flow models have been used successfully for generative image super-resolution (SR) by approximating complex distribution of natural images to simple tractable distribution in latent space through Invertible Neural Networks (INN). These models can generate multiple realistic SR images from one low-resolution (LR) input using randomly sampled points in the latent space, simulating the ill-posed nature of image upscaling where multiple high-resolution (HR) images correspond to the same LR. Lately, the invertible process in INN has also been used successfully by bidirectional image rescaling models like IRN and HCFlow for joint optimization of downscaling and inverse upscaling, resulting in significant improvements in upscaled image quality. While they are optimized for image downscaling too, the ill-posed nature of image downscaling, where one HR image could be downsized to multiple LR images depending on different interpolation kernels and resampling methods, is not considered. A new downscaling latent variable, in addition to the original one representing uncertainties in image upscaling, is introduced to model variations in the image downscaling process. This dual latent variable enhancement is applicable to different image rescaling models and it is shown in extensive experiments that it can improve image upscaling accuracy consistently without sacrificing image quality in downscaled LR images. It is also shown to be effective in enhancing other INN-based models for image restoration applications like image hiding.

* Accepted by ACM Multimedia 2022 
Viaarxiv icon