Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xuaner Zhang

Learning to Refocus with Video Diffusion Models

Dec 24, 2025

SaiKiran Tedla, Zhoutong Zhang, Xuaner Zhang, Shumian Xin

Figure 1 for Learning to Refocus with Video Diffusion Models

Figure 2 for Learning to Refocus with Video Diffusion Models

Figure 3 for Learning to Refocus with Video Diffusion Models

Figure 4 for Learning to Refocus with Video Diffusion Models

Abstract:Focus is a cornerstone of photography, yet autofocus systems often fail to capture the intended subject, and users frequently wish to adjust focus after capture. We introduce a novel method for realistic post-capture refocusing using video diffusion models. From a single defocused image, our approach generates a perceptually accurate focal stack, represented as a video sequence, enabling interactive refocusing and unlocking a range of downstream applications. We release a large-scale focal stack dataset acquired under diverse real-world smartphone conditions to support this work and future research. Our method consistently outperforms existing approaches in both perceptual quality and robustness across challenging scenarios, paving the way for more advanced focus-editing capabilities in everyday photography. Code and data are available at https://learn2refocus.github.io

* Code and data are available at https://learn2refocus.github.io . SIGGRAPH Asia 2025, Dec. 2025

Via

Access Paper or Ask Questions

LEDiff: Latent Exposure Diffusion for HDR Generation

Dec 19, 2024

Chao Wang, Zhihao Xia, Thomas Leimkuehler, Karol Myszkowski, Xuaner Zhang

Abstract:While consumer displays increasingly support more than 10 stops of dynamic range, most image assets such as internet photographs and generative AI content remain limited to 8-bit low dynamic range (LDR), constraining their utility across high dynamic range (HDR) applications. Currently, no generative model can produce high-bit, high-dynamic range content in a generalizable way. Existing LDR-to-HDR conversion methods often struggle to produce photorealistic details and physically-plausible dynamic range in the clipped areas. We introduce LEDiff, a method that enables a generative model with HDR content generation through latent space fusion inspired by image-space exposure fusion techniques. It also functions as an LDR-to-HDR converter, expanding the dynamic range of existing low-dynamic range images. Our approach uses a small HDR dataset to enable a pretrained diffusion model to recover detail and dynamic range in clipped highlights and shadows. LEDiff brings HDR capabilities to existing generative models and converts any LDR image to HDR, creating photorealistic HDR outputs for image generation, image-based lighting (HDR environment map generation), and photographic effects such as depth of field simulation, where linear HDR data is essential for realistic quality.

Via

Access Paper or Ask Questions

Instruction-based Image Manipulation by Watching How Things Move

Dec 16, 2024

Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, Zhihao Xia

Figure 1 for Instruction-based Image Manipulation by Watching How Things Move

Figure 2 for Instruction-based Image Manipulation by Watching How Things Move

Figure 3 for Instruction-based Image Manipulation by Watching How Things Move

Figure 4 for Instruction-based Image Manipulation by Watching How Things Move

Abstract:This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics-such as non-rigid subject motion and complex camera movements-that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.

* Project page: https://ljzycmd.github.io/projects/InstructMove/

Via

Access Paper or Ask Questions

Generative Portrait Shadow Removal

Oct 07, 2024

Jae Shin Yoon, Zhixin Shu, Mengwei Ren, Xuaner Zhang, Yannick Hold-Geoffroy, Krishna Kumar Singh, He Zhang

Figure 1 for Generative Portrait Shadow Removal

Figure 2 for Generative Portrait Shadow Removal

Figure 3 for Generative Portrait Shadow Removal

Figure 4 for Generative Portrait Shadow Removal

Abstract:We introduce a high-fidelity portrait shadow removal model that can effectively enhance the image of a portrait by predicting its appearance under disturbing shadows and highlights. Portrait shadow removal is a highly ill-posed problem where multiple plausible solutions can be found based on a single image. While existing works have solved this problem by predicting the appearance residuals that can propagate local shadow distribution, such methods are often incomplete and lead to unnatural predictions, especially for portraits with hard shadows. We overcome the limitations of existing local propagation methods by formulating the removal problem as a generation task where a diffusion model learns to globally rebuild the human appearance from scratch as a condition of an input portrait image. For robust and natural shadow removal, we propose to train the diffusion model with a compositional repurposing framework: a pre-trained text-guided image generation model is first fine-tuned to harmonize the lighting and color of the foreground with a background scene by using a background harmonization dataset; and then the model is further fine-tuned to generate a shadow-free portrait image via a shadow-paired dataset. To overcome the limitation of losing fine details in the latent diffusion model, we propose a guided-upsampling network to restore the original high-frequency details (wrinkles and dots) from the input image. To enable our compositional training framework, we construct a high-fidelity and large-scale dataset using a lightstage capturing system and synthetic graphics simulation. Our generative framework effectively removes shadows caused by both self and external occlusions while maintaining original lighting distribution and high-frequency details. Our method also demonstrates robustness to diverse subjects captured in real environments.

* 17 pages, siggraph asia, TOG

Via

Access Paper or Ask Questions

COMPOSE: Comprehensive Portrait Shadow Editing

Aug 25, 2024

Andrew Hou, Zhixin Shu, Xuaner Zhang, He Zhang, Yannick Hold-Geoffroy, Jae Shin Yoon, Xiaoming Liu

Figure 1 for COMPOSE: Comprehensive Portrait Shadow Editing

Figure 2 for COMPOSE: Comprehensive Portrait Shadow Editing

Figure 3 for COMPOSE: Comprehensive Portrait Shadow Editing

Figure 4 for COMPOSE: Comprehensive Portrait Shadow Editing

Abstract:Existing portrait relighting methods struggle with precise control over facial shadows, particularly when faced with challenges such as handling hard shadows from directional light sources or adjusting shadows while remaining in harmony with existing lighting conditions. In many situations, completely altering input lighting is undesirable for portrait retouching applications: one may want to preserve some authenticity in the captured environment. Existing shadow editing methods typically restrict their application to just the facial region and often offer limited lighting control options, such as shadow softening or rotation. In this paper, we introduce COMPOSE: a novel shadow editing pipeline for human portraits, offering precise control over shadow attributes such as shape, intensity, and position, all while preserving the original environmental illumination of the portrait. This level of disentanglement and controllability is obtained thanks to a novel decomposition of the environment map representation into ambient light and an editable gaussian dominant light source. COMPOSE is a four-stage pipeline that consists of light estimation and editing, light diffusion, shadow synthesis, and finally shadow editing. We define facial shadows as the result of a dominant light source, encoded using our novel gaussian environment map representation. Utilizing an OLAT dataset, we have trained models to: (1) predict this light source representation from images, and (2) generate realistic shadows using this representation. We also demonstrate comprehensive and intuitive shadow editing with our pipeline. Through extensive quantitative and qualitative evaluations, we have demonstrated the robust capability of our system in shadow editing.

* Accepted at ECCV 2024

Via

Access Paper or Ask Questions

Explorative Inbetweening of Time and Space

Mar 21, 2024

Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Victoria Abrevaya, Michael J. Black, Xuaner Zhang

Figure 1 for Explorative Inbetweening of Time and Space

Figure 2 for Explorative Inbetweening of Time and Space

Figure 3 for Explorative Inbetweening of Time and Space

Figure 4 for Explorative Inbetweening of Time and Space

Abstract:We introduce bounded generation as a generalized task to control video generation to synthesize arbitrary camera and subject motion based only on a given start and end frame. Our objective is to fully leverage the inherent generalization capability of an image-to-video model without additional training or fine-tuning of the original model. This is achieved through the proposed new sampling strategy, which we call Time Reversal Fusion, that fuses the temporally forward and backward denoising paths conditioned on the start and end frame, respectively. The fused path results in a video that smoothly connects the two frames, generating inbetweening of faithful subject motion, novel views of static scenes, and seamless video looping when the two bounding frames are identical. We curate a diverse evaluation dataset of image pairs and compare against the closest existing methods. We find that Time Reversal Fusion outperforms related work on all subtasks, exhibiting the ability to generate complex motions and 3D-consistent views guided by bounded frames. See project page at https://time-reversal.github.io.

* project page at https://time-reversal.github.io

Via

Access Paper or Ask Questions

Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Mar 19, 2024

Hadi Alzayer, Zhihao Xia, Xuaner Zhang, Eli Shechtman, Jia-Bin Huang, Michael Gharbi

Figure 1 for Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Figure 2 for Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Figure 3 for Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Figure 4 for Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Abstract:We propose a generative model that, given a coarsely edited image, synthesizes a photorealistic output that follows the prescribed layout. Our method transfers fine details from the original image and preserves the identity of its parts. Yet, it adapts it to the lighting and context defined by the new layout. Our key insight is that videos are a powerful source of supervision for this task: objects and camera motions provide many observations of how the world changes with viewpoint, lighting, and physical interactions. We construct an image dataset in which each sample is a pair of source and target frames extracted from the same video at randomly chosen time intervals. We warp the source frame toward the target using two motion models that mimic the expected test-time user edits. We supervise our model to translate the warped image into the ground truth, starting from a pretrained diffusion model. Our model design explicitly enables fine detail transfer from the source frame to the generated image, while closely following the user-specified layout. We show that by using simple segmentations and coarse 2D manipulations, we can synthesize a photorealistic edit faithful to the user's input while addressing second-order effects like harmonizing the lighting and physical interactions between edited objects.

* Project page: https://magic-fixup.github.io/

Via

Access Paper or Ask Questions

Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image

Mar 14, 2024

Yiqun Mei, Yu Zeng, He Zhang, Zhixin Shu, Xuaner Zhang, Sai Bi, Jianming Zhang, HyunJoon Jung, Vishal M. Patel

Figure 1 for Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image

Figure 2 for Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image

Figure 3 for Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image

Figure 4 for Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image

Abstract:At the core of portrait photography is the search for ideal lighting and viewpoint. The process often requires advanced knowledge in photography and an elaborate studio setup. In this work, we propose Holo-Relighting, a volumetric relighting method that is capable of synthesizing novel viewpoints, and novel lighting from a single image. Holo-Relighting leverages the pretrained 3D GAN (EG3D) to reconstruct geometry and appearance from an input portrait as a set of 3D-aware features. We design a relighting module conditioned on a given lighting to process these features, and predict a relit 3D representation in the form of a tri-plane, which can render to an arbitrary viewpoint through volume rendering. Besides viewpoint and lighting control, Holo-Relighting also takes the head pose as a condition to enable head-pose-dependent lighting effects. With these novel designs, Holo-Relighting can generate complex non-Lambertian lighting effects (e.g., specular highlights and cast shadows) without using any explicit physical lighting priors. We train Holo-Relighting with data captured with a light stage, and propose two data-rendering techniques to improve the data quality for training the volumetric relighting system. Through quantitative and qualitative experiments, we demonstrate Holo-Relighting can achieve state-of-the-arts relighting quality with better photorealism, 3D consistency and controllability.

* CVPR2024

Via

Access Paper or Ask Questions

Restoration by Generation with Constrained Priors

Dec 28, 2023

Zheng Ding, Xuaner Zhang, Zhuowen Tu, Zhihao Xia

Figure 1 for Restoration by Generation with Constrained Priors

Figure 2 for Restoration by Generation with Constrained Priors

Figure 3 for Restoration by Generation with Constrained Priors

Figure 4 for Restoration by Generation with Constrained Priors

Abstract:The inherent generative power of denoising diffusion models makes them well-suited for image restoration tasks where the objective is to find the optimal high-quality image within the generative space that closely resembles the input image. We propose a method to adapt a pretrained diffusion model for image restoration by simply adding noise to the input image to be restored and then denoise. Our method is based on the observation that the space of a generative model needs to be constrained. We impose this constraint by finetuning the generative model with a set of anchor images that capture the characteristics of the input image. With the constrained space, we can then leverage the sampling strategy used for generation to do image restoration. We evaluate against previous methods and show superior performances on multiple real-world restoration datasets in preserving identity and image quality. We also demonstrate an important and practical application on personalized restoration, where we use a personal album as the anchor images to constrain the generative space. This approach allows us to produce results that accurately preserve high-frequency details, which previous works are unable to do. Project webpage: https://gen2res.github.io.

Via

Access Paper or Ask Questions

DiffusionRig: Learning Personalized Priors for Facial Appearance Editing

Apr 13, 2023

Zheng Ding, Xuaner Zhang, Zhihao Xia, Lars Jebe, Zhuowen Tu, Xiuming Zhang

Figure 1 for DiffusionRig: Learning Personalized Priors for Facial Appearance Editing

Figure 2 for DiffusionRig: Learning Personalized Priors for Facial Appearance Editing

Figure 3 for DiffusionRig: Learning Personalized Priors for Facial Appearance Editing

Figure 4 for DiffusionRig: Learning Personalized Priors for Facial Appearance Editing

Abstract:We address the problem of learning person-specific facial priors from a small number (e.g., 20) of portrait photos of the same person. This enables us to edit this specific person's facial appearance, such as expression and lighting, while preserving their identity and high-frequency facial details. Key to our approach, which we dub DiffusionRig, is a diffusion model conditioned on, or "rigged by," crude 3D face models estimated from single in-the-wild images by an off-the-shelf estimator. On a high level, DiffusionRig learns to map simplistic renderings of 3D face models to realistic photos of a given person. Specifically, DiffusionRig is trained in two stages: It first learns generic facial priors from a large-scale face dataset and then person-specific priors from a small portrait photo collection of the person of interest. By learning the CGI-to-photo mapping with such personalized priors, DiffusionRig can "rig" the lighting, facial expression, head pose, etc. of a portrait photo, conditioned only on coarse 3D models while preserving this person's identity and other high-frequency characteristics. Qualitative and quantitative experiments show that DiffusionRig outperforms existing approaches in both identity preservation and photorealism. Please see the project website: https://diffusionrig.github.io for the supplemental material, video, code, and data.

* CVPR 2023. Project website: https://diffusionrig.github.io

Via

Access Paper or Ask Questions