Alert button
Picture for Kwang Moo Yi

Kwang Moo Yi

Alert button

INVE: Interactive Neural Video Editing

Jul 15, 2023
Jiahui Huang, Leonid Sigal, Kwang Moo Yi, Oliver Wang, Joon-Young Lee

Figure 1 for INVE: Interactive Neural Video Editing
Figure 2 for INVE: Interactive Neural Video Editing
Figure 3 for INVE: Interactive Neural Video Editing
Figure 4 for INVE: Interactive Neural Video Editing

We present Interactive Neural Video Editing (INVE), a real-time video editing solution, which can assist the video editing process by consistently propagating sparse frame edits to the entire video clip. Our method is inspired by the recent work on Layered Neural Atlas (LNA). LNA, however, suffers from two major drawbacks: (1) the method is too slow for interactive editing, and (2) it offers insufficient support for some editing use cases, including direct frame editing and rigid texture tracking. To address these challenges we leverage and adopt highly efficient network architectures, powered by hash-grids encoding, to substantially improve processing speed. In addition, we learn bi-directional functions between image-atlas and introduce vectorized editing, which collectively enables a much greater variety of edits in both the atlas and the frames directly. Compared to LNA, our INVE reduces the learning and inference time by a factor of 5, and supports various video editing operations that LNA cannot. We showcase the superiority of INVE over LNA in interactive video editing through a comprehensive quantitative and qualitative analysis, highlighting its numerous advantages and improved performance. For video results, please see https://gabriel-huang.github.io/inve/

Viaarxiv icon

Unsupervised Semantic Correspondence Using Stable Diffusion

May 24, 2023
Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, Kwang Moo Yi

Figure 1 for Unsupervised Semantic Correspondence Using Stable Diffusion
Figure 2 for Unsupervised Semantic Correspondence Using Stable Diffusion
Figure 3 for Unsupervised Semantic Correspondence Using Stable Diffusion
Figure 4 for Unsupervised Semantic Correspondence Using Stable Diffusion

Text-to-image diffusion models are now capable of generating images that are often indistinguishable from real images. To generate such images, these models must understand the semantics of the objects they are asked to generate. In this work we show that, without any training, one can leverage this semantic knowledge within diffusion models to find semantic correspondences -- locations in multiple images that have the same semantic meaning. Specifically, given an image, we optimize the prompt embeddings of these models for maximum attention on the regions of interest. These optimized embeddings capture semantic information about the location, which can then be transferred to another image. By doing so we obtain results on par with the strongly supervised state of the art on the PF-Willow dataset and significantly outperform (20.9% relative for the SPair-71k dataset) any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.

Viaarxiv icon

PPDONet: Deep Operator Networks for Fast Prediction of Steady-State Solutions in Disk-Planet Systems

May 18, 2023
Shunyuan Mao, Ruobing Dong, Lu Lu, Kwang Moo Yi, Sifan Wang, Paris Perdikaris

Figure 1 for PPDONet: Deep Operator Networks for Fast Prediction of Steady-State Solutions in Disk-Planet Systems
Figure 2 for PPDONet: Deep Operator Networks for Fast Prediction of Steady-State Solutions in Disk-Planet Systems
Figure 3 for PPDONet: Deep Operator Networks for Fast Prediction of Steady-State Solutions in Disk-Planet Systems
Figure 4 for PPDONet: Deep Operator Networks for Fast Prediction of Steady-State Solutions in Disk-Planet Systems

We develop a tool, which we name Protoplanetary Disk Operator Network (PPDONet), that can predict the solution of disk-planet interactions in protoplanetary disks in real-time. We base our tool on Deep Operator Networks (DeepONets), a class of neural networks capable of learning non-linear operators to represent deterministic and stochastic differential equations. With PPDONet we map three scalar parameters in a disk-planet system -- the Shakura \& Sunyaev viscosity $\alpha$, the disk aspect ratio $h_\mathrm{0}$, and the planet-star mass ratio $q$ -- to steady-state solutions of the disk surface density, radial velocity, and azimuthal velocity. We demonstrate the accuracy of the PPDONet solutions using a comprehensive set of tests. Our tool is able to predict the outcome of disk-planet interaction for one system in less than a second on a laptop. A public implementation of PPDONet is available at \url{https://github.com/smao-astro/PPDONet}.

* 10 pages, 6 figures, 2 tables; ApJL accepted 
Viaarxiv icon

BlendFields: Few-Shot Example-Driven Facial Modeling

May 12, 2023
Kacper Kania, Stephan J. Garbin, Andrea Tagliasacchi, Virginia Estellers, Kwang Moo Yi, Julien Valentin, Tomasz Trzciński, Marek Kowalski

Figure 1 for BlendFields: Few-Shot Example-Driven Facial Modeling
Figure 2 for BlendFields: Few-Shot Example-Driven Facial Modeling
Figure 3 for BlendFields: Few-Shot Example-Driven Facial Modeling
Figure 4 for BlendFields: Few-Shot Example-Driven Facial Modeling

Generating faithful visualizations of human faces requires capturing both coarse and fine-level details of the face geometry and appearance. Existing methods are either data-driven, requiring an extensive corpus of data not publicly accessible to the research community, or fail to capture fine details because they rely on geometric face models that cannot represent fine-grained details in texture with a mesh discretization and linear deformation designed to model only a coarse face geometry. We introduce a method that bridges this gap by drawing inspiration from traditional computer graphics techniques. Unseen expressions are modeled by blending appearance from a sparse set of extreme poses. This blending is performed by measuring local volumetric changes in those expressions and locally reproducing their appearance whenever a similar expression is performed at test time. We show that our method generalizes to unseen expressions, adding fine-grained effects on top of smooth volumetric deformations of a face, and demonstrate how it generalizes beyond faces.

* Accepted to CVPR 2023. Project page: https://blendfields.github.io/ 
Viaarxiv icon

Pointersect: Neural Rendering with Cloud-Ray Intersection

Apr 24, 2023
Jen-Hao Rick Chang, Wei-Yu Chen, Anurag Ranjan, Kwang Moo Yi, Oncel Tuzel

Figure 1 for Pointersect: Neural Rendering with Cloud-Ray Intersection
Figure 2 for Pointersect: Neural Rendering with Cloud-Ray Intersection
Figure 3 for Pointersect: Neural Rendering with Cloud-Ray Intersection
Figure 4 for Pointersect: Neural Rendering with Cloud-Ray Intersection

We propose a novel method that renders point clouds as if they are surfaces. The proposed method is differentiable and requires no scene-specific optimization. This unique capability enables, out-of-the-box, surface normal estimation, rendering room-scale point clouds, inverse rendering, and ray tracing with global illumination. Unlike existing work that focuses on converting point clouds to other representations--e.g., surfaces or implicit functions--our key idea is to directly infer the intersection of a light ray with the underlying surface represented by the given point cloud. Specifically, we train a set transformer that, given a small number of local neighbor points along a light ray, provides the intersection point, the surface normal, and the material blending weights, which are used to render the outcome of this light ray. Localizing the problem into small neighborhoods enables us to train a model with only 48 meshes and apply it to unseen point clouds. Our model achieves higher estimation accuracy than state-of-the-art surface reconstruction and point-cloud rendering methods on three test sets. When applied to room-scale point clouds, without any scene-specific optimization, the model achieves competitive quality with the state-of-the-art novel-view rendering methods. Moreover, we demonstrate ability to render and manipulate Lidar-scanned point clouds such as lighting control and object insertion.

* CVPR 2023 
Viaarxiv icon

FaceLit: Neural 3D Relightable Faces

Mar 27, 2023
Anurag Ranjan, Kwang Moo Yi, Jen-Hao Rick Chang, Oncel Tuzel

Figure 1 for FaceLit: Neural 3D Relightable Faces
Figure 2 for FaceLit: Neural 3D Relightable Faces
Figure 3 for FaceLit: Neural 3D Relightable Faces
Figure 4 for FaceLit: Neural 3D Relightable Faces

We propose a generative framework, FaceLit, capable of generating a 3D face that can be rendered at various user-defined lighting conditions and views, learned purely from 2D images in-the-wild without any manual annotation. Unlike existing works that require careful capture setup or human labor, we rely on off-the-shelf pose and illumination estimators. With these estimates, we incorporate the Phong reflectance model in the neural volume rendering framework. Our model learns to generate shape and material properties of a face such that, when rendered according to the natural statistics of pose and illumination, produces photorealistic face images with multiview 3D and illumination consistency. Our method enables photorealistic generation of faces with explicit illumination and view controls on multiple datasets - FFHQ, MetFaces and CelebA-HQ. We show state-of-the-art photorealism among 3D aware GANs on FFHQ dataset achieving an FID score of 3.5.

* CVPR 2023 
Viaarxiv icon

Neural Fourier Filter Bank

Dec 04, 2022
Zhijie Wu, Yuhe Jin, Kwang Moo Yi

Figure 1 for Neural Fourier Filter Bank
Figure 2 for Neural Fourier Filter Bank
Figure 3 for Neural Fourier Filter Bank
Figure 4 for Neural Fourier Filter Bank

We present a novel method to provide efficient and highly detailed reconstructions. Inspired by wavelets, our main idea is to learn a neural field that decompose the signal both spatially and frequency-wise. We follow the recent grid-based paradigm for spatial decomposition, but unlike existing work, encourage specific frequencies to be stored in each grid via Fourier features encodings. We then apply a multi-layer perceptron with sine activations, taking these Fourier encoded features in at appropriate layers so that higher-frequency components are accumulated on top of lower-frequency components sequentially, which we sum up to form the final output. We demonstrate that our method outperforms the state of the art regarding model compactness and efficiency on multiple tasks: 2D image fitting, 3D shape reconstruction, and neural radiance fields.

Viaarxiv icon

Bootstrapping Human Optical Flow and Pose

Oct 28, 2022
Aritro Roy Arko, James J. Little, Kwang Moo Yi

Figure 1 for Bootstrapping Human Optical Flow and Pose
Figure 2 for Bootstrapping Human Optical Flow and Pose
Figure 3 for Bootstrapping Human Optical Flow and Pose
Figure 4 for Bootstrapping Human Optical Flow and Pose

We propose a bootstrapping framework to enhance human optical flow and pose. We show that, for videos involving humans in scenes, we can improve both the optical flow and the pose estimation quality of humans by considering the two tasks at the same time. We enhance optical flow estimates by fine-tuning them to fit the human pose estimates and vice versa. In more detail, we optimize the pose and optical flow networks to, at inference time, agree with each other. We show that this results in state-of-the-art results on the Human 3.6M and 3D Poses in the Wild datasets, as well as a human-related subset of the Sintel dataset, both in terms of pose estimation accuracy and the optical flow accuracy at human joint locations. Code available at https://github.com/ubc-vision/bootstrapping-human-optical-flow-and-pose

* Accepted at BMVC 2022. Supplementary qualitative results - https://aritro30.github.io/results/. Code at https://github.com/ubc-vision/bootstrapping-human-optical-flow-and-pose 
Viaarxiv icon

Attention Beats Concatenation for Conditioning Neural Fields

Sep 21, 2022
Daniel Rebain, Mark J. Matthews, Kwang Moo Yi, Gopal Sharma, Dmitry Lagun, Andrea Tagliasacchi

Figure 1 for Attention Beats Concatenation for Conditioning Neural Fields
Figure 2 for Attention Beats Concatenation for Conditioning Neural Fields
Figure 3 for Attention Beats Concatenation for Conditioning Neural Fields
Figure 4 for Attention Beats Concatenation for Conditioning Neural Fields

Neural fields model signals by mapping coordinate inputs to sampled values. They are becoming an increasingly important backbone architecture across many fields from vision and graphics to biology and astronomy. In this paper, we explore the differences between common conditioning mechanisms within these networks, an essential ingredient in shifting neural fields from memorization of signals to generalization, where the set of signals lying on a manifold is modelled jointly. In particular, we are interested in the scaling behaviour of these mechanisms to increasingly high-dimensional conditioning variables. As we show in our experiments, high-dimensional conditioning is key to modelling complex data distributions, thus it is important to determine what architecture choices best enable this when working on such problems. To this end, we run experiments modelling 2D, 3D, and 4D signals with neural fields, employing concatenation, hyper-network, and attention-based conditioning strategies -- a necessary but laborious effort that has not been performed in the literature. We find that attention-based conditioning outperforms other approaches in a variety of settings.

Viaarxiv icon

Estimating Visual Information From Audio Through Manifold Learning

Aug 03, 2022
Fabrizio Pedersoli, Dryden Wiebe, Amin Banitalebi, Yong Zhang, Kwang Moo Yi

Figure 1 for Estimating Visual Information From Audio Through Manifold Learning
Figure 2 for Estimating Visual Information From Audio Through Manifold Learning
Figure 3 for Estimating Visual Information From Audio Through Manifold Learning
Figure 4 for Estimating Visual Information From Audio Through Manifold Learning

We propose a new framework for extracting visual information about a scene only using audio signals. Audio-based methods can overcome some of the limitations of vision-based methods i.e., they do not require "line-of-sight", are robust to occlusions and changes in illumination, and can function as a backup in case vision/lidar sensors fail. Therefore, audio-based methods can be useful even for applications in which only visual information is of interest Our framework is based on Manifold Learning and consists of two steps. First, we train a Vector-Quantized Variational Auto-Encoder to learn the data manifold of the particular visual modality we are interested in. Second, we train an Audio Transformation network to map multi-channel audio signals to the latent representation of the corresponding visual sample. We show that our method is able to produce meaningful images from audio using a publicly available audio/visual dataset. In particular, we consider the prediction of the following visual modalities from audio: depth and semantic segmentation. We hope the findings of our work can facilitate further research in visual information extraction from audio. Code is available at: https://github.com/ubc-vision/audio_manifold.

Viaarxiv icon