Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kyle Olszewski

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Mar 04, 2022

Ligong Han, Jian Ren, Hsin-Ying Lee, Francesco Barbieri, Kyle Olszewski, Shervin Minaee, Dimitris Metaxas, Sergey Tulyakov

Figure 1 for Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Figure 2 for Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Figure 3 for Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Figure 4 for Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

Abstract:Most methods for conditional video synthesis use a single modality as the condition. This comes with major limitations. For example, it is problematic for a model conditioned on an image to generate a specific motion trajectory desired by the user since there is no means to provide motion information. Conversely, language information can describe the desired motion, while not precisely defining the content of the video. This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately. We leverage the recent progress in quantized representations for videos and apply a bidirectional transformer with multiple modalities as inputs to predict a discrete video representation. To improve video quality and consistency, we propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens. We introduce text augmentation to improve the robustness of the textual representation and diversity of generated videos. Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images. It can generate much longer sequences than the one used for training. In addition, our model can extract visual information as suggested by the text prompt, e.g., "an object in image one is moving northeast", and generate corresponding videos. We run evaluations on three public datasets and a newly collected dataset labeled with facial attributes, achieving state-of-the-art generation results on all four.

* Accepted to CVPR 2022

Via

Access Paper or Ask Questions

NeROIC: Neural Rendering of Objects from Online Image Collections

Jan 07, 2022

Zhengfei Kuang, Kyle Olszewski, Menglei Chai, Zeng Huang, Panos Achlioptas, Sergey Tulyakov

Figure 1 for NeROIC: Neural Rendering of Objects from Online Image Collections

Figure 2 for NeROIC: Neural Rendering of Objects from Online Image Collections

Figure 3 for NeROIC: Neural Rendering of Objects from Online Image Collections

Figure 4 for NeROIC: Neural Rendering of Objects from Online Image Collections

Abstract:We present a novel method to acquire object representations from online image collections, capturing high-quality geometry and material properties of arbitrary objects from photographs with varying cameras, illumination, and backgrounds. This enables various object-centric rendering applications such as novel-view synthesis, relighting, and harmonized background composition from challenging in-the-wild input. Using a multi-stage approach extending neural radiance fields, we first infer the surface geometry and refine the coarsely estimated initial camera parameters, while leveraging coarse foreground object masks to improve the training efficiency and geometry quality. We also introduce a robust normal estimation technique which eliminates the effect of geometric noise while retaining crucial details. Lastly, we extract surface material properties and ambient illumination, represented in spherical harmonics with extensions that handle transient elements, e.g. sharp shadows. The union of these components results in a highly modular and efficient object acquisition framework. Extensive evaluations and comparisons demonstrate the advantages of our approach in capturing high-quality geometry and appearance properties useful for rendering applications.

* Project page including code can be found at: https://formyfamily.github.io/NeROIC/

Via

Access Paper or Ask Questions

Flow Guided Transformable Bottleneck Networks for Motion Retargeting

Jun 14, 2021

Jian Ren, Menglei Chai, Oliver J. Woodford, Kyle Olszewski, Sergey Tulyakov

Figure 1 for Flow Guided Transformable Bottleneck Networks for Motion Retargeting

Figure 2 for Flow Guided Transformable Bottleneck Networks for Motion Retargeting

Figure 3 for Flow Guided Transformable Bottleneck Networks for Motion Retargeting

Figure 4 for Flow Guided Transformable Bottleneck Networks for Motion Retargeting

Abstract:Human motion retargeting aims to transfer the motion of one person in a "driving" video or set of images to another person. Existing efforts leverage a long training video from each target person to train a subject-specific motion transfer model. However, the scalability of such methods is limited, as each model can only generate videos for the given target subject, and such training videos are labor-intensive to acquire and process. Few-shot motion transfer techniques, which only require one or a few images from a target, have recently drawn considerable attention. Methods addressing this task generally use either 2D or explicit 3D representations to transfer motion, and in doing so, sacrifice either accurate geometric modeling or the flexibility of an end-to-end learned representation. Inspired by the Transformable Bottleneck Network, which renders novel views and manipulations of rigid objects, we propose an approach based on an implicit volumetric representation of the image content, which can then be spatially manipulated using volumetric flow fields. We address the challenging question of how to aggregate information across different body poses, learning flow fields that allow for combining content from the appropriate regions of input images of highly non-rigid human subjects performing complex motions into a single implicit volumetric representation. This allows us to learn our 3D representation solely from videos of moving people. Armed with both 3D object understanding and end-to-end learned rendering, this categorically novel representation delivers state-of-the-art image generation quality, as shown by our quantitative and qualitative evaluations.

* CVPR 2021

Via

Access Paper or Ask Questions

A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Apr 30, 2021

Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N. Metaxas, Sergey Tulyakov

Figure 1 for A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Figure 2 for A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Figure 3 for A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Figure 4 for A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Abstract:Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving image-based models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic. We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. Not only does such a framework render high-resolution videos, but it also is an order of magnitude more computationally efficient. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled. With such a representation, our framework allows for a broad range of applications, including content and motion manipulation. Furthermore, we introduce a new task, which we call cross-domain video synthesis, in which the image and motion generators are trained on disjoint datasets belonging to different domains. This allows for generating moving objects for which the desired video data is not available. Extensive experiments on various datasets demonstrate the advantages of our methods over existing video generation techniques. Code will be released at https://github.com/snap-research/MoCoGAN-HD.

* Accepted to ICLR 2021

Via

Access Paper or Ask Questions

Monocular Real-Time Volumetric Performance Capture

Jul 28, 2020

Ruilong Li, Yuliang Xiu, Shunsuke Saito, Zeng Huang, Kyle Olszewski, Hao Li

Figure 1 for Monocular Real-Time Volumetric Performance Capture

Figure 2 for Monocular Real-Time Volumetric Performance Capture

Figure 3 for Monocular Real-Time Volumetric Performance Capture

Figure 4 for Monocular Real-Time Volumetric Performance Capture

Abstract:We present the first approach to volumetric performance capture and novel-view rendering at real-time speed from monocular video, eliminating the need for expensive multi-view systems or cumbersome pre-acquisition of a personalized template model. Our system reconstructs a fully textured 3D human from each frame by leveraging Pixel-Aligned Implicit Function (PIFu). While PIFu achieves high-resolution reconstruction in a memory-efficient manner, its computationally expensive inference prevents us from deploying such a system for real-time applications. To this end, we propose a novel hierarchical surface localization algorithm and a direct rendering method without explicitly extracting surface meshes. By culling unnecessary regions for evaluation in a coarse-to-fine manner, we successfully accelerate the reconstruction by two orders of magnitude from the baseline without compromising the quality. Furthermore, we introduce an Online Hard Example Mining (OHEM) technique that effectively suppresses failure modes due to the rare occurrence of challenging examples. We adaptively update the sampling probability of the training data based on the current reconstruction accuracy, which effectively alleviates reconstruction artifacts. Our experiments and evaluations demonstrate the robustness of our system to various challenging angles, illuminations, poses, and clothing styles. We also show that our approach compares favorably with the state-of-the-art monocular performance capture. Our proposed approach removes the need for multi-view studio settings and enables a consumer-accessible solution for volumetric capture.

Via

Access Paper or Ask Questions

Intuitive, Interactive Beard and Hair Synthesis with Generative Models

Apr 15, 2020

Kyle Olszewski, Duygu Ceylan, Jun Xing, Jose Echevarria, Zhili Chen, Weikai Chen, Hao Li

Figure 1 for Intuitive, Interactive Beard and Hair Synthesis with Generative Models

Figure 2 for Intuitive, Interactive Beard and Hair Synthesis with Generative Models

Figure 3 for Intuitive, Interactive Beard and Hair Synthesis with Generative Models

Figure 4 for Intuitive, Interactive Beard and Hair Synthesis with Generative Models

Abstract:We present an interactive approach to synthesizing realistic variations in facial hair in images, ranging from subtle edits to existing hair to the addition of complex and challenging hair in images of clean-shaven subjects. To circumvent the tedious and computationally expensive tasks of modeling, rendering and compositing the 3D geometry of the target hairstyle using the traditional graphics pipeline, we employ a neural network pipeline that synthesizes realistic and detailed images of facial hair directly in the target image in under one second. The synthesis is controlled by simple and sparse guide strokes from the user defining the general structural and color properties of the target hairstyle. We qualitatively and quantitatively evaluate our chosen method compared to several alternative approaches. We show compelling interactive editing results with a prototype user interface that allows novice users to progressively refine the generated image to match their desired hairstyle, and demonstrate that our approach also allows for flexible and high-fidelity scalp hair synthesis.

* To be presented in the 2020 Conference on Computer Vision and Pattern Recognition (CVPR 2020, Oral Presentation). Supplementary video can be seen at: https://www.youtube.com/watch?v=v4qOtBATrvM

Via

Access Paper or Ask Questions

Transformable Bottleneck Networks

Apr 23, 2019

Kyle Olszewski, Sergey Tulyakov, Oliver Woodford, Hao Li, Linjie Luo

Figure 1 for Transformable Bottleneck Networks

Figure 2 for Transformable Bottleneck Networks

Figure 3 for Transformable Bottleneck Networks

Figure 4 for Transformable Bottleneck Networks

Abstract:We propose a novel approach to performing fine-grained 3D manipulation of image content via a convolutional neural network, which we call the Transformable Bottleneck Network (TBN). It applies given spatial transformations directly to a volumetric bottleneck within our encoder-bottleneck-decoder architecture. Multi-view supervision encourages the network to learn to spatially disentangle the feature space within the bottleneck. The resulting spatial structure can be manipulated with arbitrary spatial transformations. We demonstrate the efficacy of TBNs for novel view synthesis, achieving state-of-the-art results on a challenging benchmark. We demonstrate that the bottlenecks produced by networks trained for this task contain meaningful spatial structure that allows us to intuitively perform a variety of image manipulations in 3D, well beyond the rigid transformations seen during training. These manipulations include non-uniform scaling, non-rigid warping, and combining content from different images. Finally, we extract explicit 3D structure from the bottleneck, performing impressive 3D reconstruction from a single input image.

Via

Access Paper or Ask Questions