Alert button
Picture for Helge Rhodin

Helge Rhodin

Alert button

A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization

Nov 07, 2023
Xingzhe He, Zhiwen Cao, Nicholas Kolkin, Lantao Yu, Helge Rhodin, Ratheesh Kalarot

Large text-to-image models have revolutionized the ability to generate imagery using natural language. However, particularly unique or personal visual concepts, such as your pet, an object in your house, etc., will not be captured by the original model. This has led to interest in how to inject new visual concepts, bound to a new text token, using as few as 4-6 examples. Despite significant progress, this task remains a formidable challenge, particularly in preserving the subject's identity. While most researchers attempt to to address this issue by modifying model architectures, our approach takes a data-centric perspective, advocating the modification of data rather than the model itself. We introduce a novel regularization dataset generation strategy on both the text and image level; demonstrating the importance of a rich and structured regularization dataset (automatically generated) to prevent losing text coherence and better identity preservation. The better quality is enabled by allowing up to 5x more fine-tuning iterations without overfitting and degeneration. The generated renditions of the desired subject preserve even fine details such as text and logos; all while maintaining the ability to generate diverse samples that follow the input text prompt. Since our method focuses on data augmentation, rather than adjusting the model architecture, it is complementary and can be combined with prior work. We show on established benchmarks that our data-centric approach forms the new state of the art in terms of image quality, with the best trade-off between identity preservation, diversity, and text alignment.

Viaarxiv icon

Mirror-Aware Neural Humans

Sep 09, 2023
Daniel Ajisafe, James Tang, Shih-Yang Su, Bastian Wandt, Helge Rhodin

Figure 1 for Mirror-Aware Neural Humans
Figure 2 for Mirror-Aware Neural Humans
Figure 3 for Mirror-Aware Neural Humans
Figure 4 for Mirror-Aware Neural Humans

Human motion capture either requires multi-camera systems or is unreliable using single-view input due to depth ambiguities. Meanwhile, mirrors are readily available in urban environments and form an affordable alternative by recording two views with only a single camera. However, the mirror setting poses the additional challenge of handling occlusions of real and mirror image. Going beyond existing mirror approaches for 3D human pose estimation, we utilize mirrors for learning a complete body model, including shape and dense appearance. Our main contributions are extending articulated neural radiance fields to include a notion of a mirror, making it sample-efficient over potential occlusion regions. Together, our contributions realize a consumer-level 3D motion capture system that starts from off-the-shelf 2D poses by automatically calibrating the camera, estimating mirror orientation, and subsequently lifting 2D keypoint detections to 3D skeleton pose that is used to condition the mirror-aware NeRF. We empirically demonstrate the benefit of learning a body model and accounting for occlusion in challenging mirror scenes.

* Project website: https://danielajisafe.github.io/mirror-aware-neural-humans/ 
Viaarxiv icon

Pose Modulated Avatars from Video

Aug 25, 2023
Chunjin Song, Bastian Wandt, Helge Rhodin

Figure 1 for Pose Modulated Avatars from Video
Figure 2 for Pose Modulated Avatars from Video
Figure 3 for Pose Modulated Avatars from Video
Figure 4 for Pose Modulated Avatars from Video

It is now possible to reconstruct dynamic human motion and shape from a sparse set of cameras using Neural Radiance Fields (NeRF) driven by an underlying skeleton. However, a challenge remains to model the deformation of cloth and skin in relation to skeleton pose. Unlike existing avatar models that are learned implicitly or rely on a proxy surface, our approach is motivated by the observation that different poses necessitate unique frequency assignments. Neglecting this distinction yields noisy artifacts in smooth areas or blurs fine-grained texture and shape details in sharp regions. We develop a two-branch neural network that is adaptive and explicit in the frequency domain. The first branch is a graph neural network that models correlations among body parts locally, taking skeleton pose as input. The second branch combines these correlation features to a set of global frequencies and then modulates the feature encoding. Our experiments demonstrate that our network outperforms state-of-the-art methods in terms of preserving details and generalization capabilities.

Viaarxiv icon

NPC: Neural Point Characters from Video

Apr 04, 2023
Shih-Yang Su, Timur Bagautdinov, Helge Rhodin

Figure 1 for NPC: Neural Point Characters from Video
Figure 2 for NPC: Neural Point Characters from Video
Figure 3 for NPC: Neural Point Characters from Video
Figure 4 for NPC: Neural Point Characters from Video

High-fidelity human 3D models can now be learned directly from videos, typically by combining a template-based surface model with neural representations. However, obtaining a template surface requires expensive multi-view capture systems, laser scans, or strictly controlled conditions. Previous methods avoid using a template but rely on a costly or ill-posed mapping from observation to canonical space. We propose a hybrid point-based representation for reconstructing animatable characters that does not require an explicit surface model, while being generalizable to novel poses. For a given video, our method automatically produces an explicit set of 3D points representing approximate canonical geometry, and learns an articulated deformation model that produces pose-dependent point transformations. The points serve both as a scaffold for high-frequency neural features and an anchor for efficiently mapping between observation and canonical space. We demonstrate on established benchmarks that our representation overcomes limitations of prior work operating in either canonical or in observation space. Moreover, our automatic point extraction approach enables learning models of human and animal characters alike, matching the performance of the methods using rigged surface templates despite being more general. Project website: https://lemonatsu.github.io/npc/

* Project website: https://lemonatsu.github.io/npc/ 
Viaarxiv icon

Few-shot Geometry-Aware Keypoint Localization

Mar 30, 2023
Xingzhe He, Gaurav Bharaj, David Ferman, Helge Rhodin, Pablo Garrido

Figure 1 for Few-shot Geometry-Aware Keypoint Localization
Figure 2 for Few-shot Geometry-Aware Keypoint Localization
Figure 3 for Few-shot Geometry-Aware Keypoint Localization
Figure 4 for Few-shot Geometry-Aware Keypoint Localization

Supervised keypoint localization methods rely on large manually labeled image datasets, where objects can deform, articulate, or occlude. However, creating such large keypoint labels is time-consuming and costly, and is often error-prone due to inconsistent labeling. Thus, we desire an approach that can learn keypoint localization with fewer yet consistently annotated images. To this end, we present a novel formulation that learns to localize semantically consistent keypoint definitions, even for occluded regions, for varying object categories. We use a few user-labeled 2D images as input examples, which are extended via self-supervision using a larger unlabeled dataset. Unlike unsupervised methods, the few-shot images act as semantic shape constraints for object localization. Furthermore, we introduce 3D geometry-aware constraints to uplift keypoints, achieving more accurate 2D localization. Our general-purpose formulation paves the way for semantically conditioned generative modeling and attains competitive or state-of-the-art accuracy on several datasets, including human faces, eyes, animals, cars, and never-before-seen mouth interior (teeth) localization tasks, not attempted by the previous few-shot methods. Project page: https://xingzhehe.github.io/FewShot3DKP/}{https://xingzhehe.github.io/FewShot3DKP/

* Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023  
* CVPR 2023 
Viaarxiv icon

Scaling Neural Face Synthesis to High FPS and Low Latency by Neural Caching

Nov 10, 2022
Frank Yu, Sid Fels, Helge Rhodin

Figure 1 for Scaling Neural Face Synthesis to High FPS and Low Latency by Neural Caching
Figure 2 for Scaling Neural Face Synthesis to High FPS and Low Latency by Neural Caching
Figure 3 for Scaling Neural Face Synthesis to High FPS and Low Latency by Neural Caching
Figure 4 for Scaling Neural Face Synthesis to High FPS and Low Latency by Neural Caching

Recent neural rendering approaches greatly improve image quality, reaching near photorealism. However, the underlying neural networks have high runtime, precluding telepresence and virtual reality applications that require high resolution at low latency. The sequential dependency of layers in deep networks makes their optimization difficult. We break this dependency by caching information from the previous frame to speed up the processing of the current one with an implicit warp. The warping with a shallow network reduces latency and the caching operations can further be parallelized to improve the frame rate. In contrast to existing temporal neural networks, ours is tailored for the task of rendering novel views of faces by conditioning on the change of the underlying surface mesh. We test the approach on view-dependent rendering of 3D portrait avatars, as needed for telepresence, on established benchmark sequences. Warping reduces latency by 70$\%$ (from 49.4ms to 14.9ms on commodity GPUs) and scales frame rates accordingly over multiple GPUs while reducing image quality by only 1$\%$, making it suitable as part of end-to-end view-dependent 3D teleconferencing applications. Our project page can be found at: https://yu-frank.github.io/lowlatency/.

Viaarxiv icon

UNeRF: Time and Memory Conscious U-Shaped Network for Training Neural Radiance Fields

Jun 23, 2022
Abiramy Kuganesan, Shih-yang Su, James J. Little, Helge Rhodin

Figure 1 for UNeRF: Time and Memory Conscious U-Shaped Network for Training Neural Radiance Fields
Figure 2 for UNeRF: Time and Memory Conscious U-Shaped Network for Training Neural Radiance Fields
Figure 3 for UNeRF: Time and Memory Conscious U-Shaped Network for Training Neural Radiance Fields
Figure 4 for UNeRF: Time and Memory Conscious U-Shaped Network for Training Neural Radiance Fields

Neural Radiance Fields (NeRFs) increase reconstruction detail for novel view synthesis and scene reconstruction, with applications ranging from large static scenes to dynamic human motion. However, the increased resolution and model-free nature of such neural fields come at the cost of high training times and excessive memory requirements. Recent advances improve the inference time by using complementary data structures yet these methods are ill-suited for dynamic scenes and often increase memory consumption. Little has been done to reduce the resources required at training time. We propose a method to exploit the redundancy of NeRF's sample-based computations by partially sharing evaluations across neighboring sample points. Our UNeRF architecture is inspired by the UNet, where spatial resolution is reduced in the middle of the network and information is shared between adjacent samples. Although this change violates the strict and conscious separation of view-dependent appearance and view-independent density estimation in the NeRF method, we show that it improves novel view synthesis. We also introduce an alternative subsampling strategy which shares computation while minimizing any violation of view invariance. UNeRF is a plug-in module for the original NeRF network. Our major contributions include reduction of the memory footprint, improved accuracy, and reduced amortized processing time both during training and inference. With only weak assumptions on locality, we achieve improved resource utilization on a variety of neural radiance fields tasks. We demonstrate applications to the novel view synthesis of static scenes as well as dynamic human shape and motion.

Viaarxiv icon

AutoLink: Self-supervised Learning of Human Skeletons and Object Outlines by Linking Keypoints

May 21, 2022
Xingzhe He, Bastian Wandt, Helge Rhodin

Figure 1 for AutoLink: Self-supervised Learning of Human Skeletons and Object Outlines by Linking Keypoints
Figure 2 for AutoLink: Self-supervised Learning of Human Skeletons and Object Outlines by Linking Keypoints
Figure 3 for AutoLink: Self-supervised Learning of Human Skeletons and Object Outlines by Linking Keypoints
Figure 4 for AutoLink: Self-supervised Learning of Human Skeletons and Object Outlines by Linking Keypoints

Structured representations such as keypoints are widely used in pose transfer, conditional image generation, animation, and 3D reconstruction. However, their supervised learning requires expensive annotation for each target domain. We propose a self-supervised method that learns to disentangle object structure from the appearance with a graph of 2D keypoints linked by straight edges. Both the keypoint location and their pairwise edge weights are learned, given only a collection of images depicting the same object class. The graph is interpretable, for example, AutoLink recovers the human skeleton topology when applied to images showing people. Our key ingredients are i) an encoder that predicts keypoint locations in an input image, ii) a shared graph as a latent variable that links the same pairs of keypoints in every image, iii) an intermediate edge map that combines the latent graph edge weights and keypoint locations in a soft, differentiable manner, and iv) an inpainting objective on randomly masked images. Although simpler, AutoLink outperforms existing self-supervised methods on the established keypoint and pose estimation benchmarks and paves the way for structure-conditioned generative models on more diverse datasets.

Viaarxiv icon

LatentKeypointGAN: Controlling Images via Latent Keypoints -- Extended Abstract

May 17, 2022
Xingzhe He, Bastian Wandt, Helge Rhodin

Figure 1 for LatentKeypointGAN: Controlling Images via Latent Keypoints -- Extended Abstract
Figure 2 for LatentKeypointGAN: Controlling Images via Latent Keypoints -- Extended Abstract
Figure 3 for LatentKeypointGAN: Controlling Images via Latent Keypoints -- Extended Abstract
Figure 4 for LatentKeypointGAN: Controlling Images via Latent Keypoints -- Extended Abstract

Generative adversarial networks (GANs) can now generate photo-realistic images. However, how to best control the image content remains an open challenge. We introduce LatentKeypointGAN, a two-stage GAN internally conditioned on a set of keypoints and associated appearance embeddings providing control of the position and style of the generated objects and their respective parts. A major difficulty that we address is disentangling the image into spatial and appearance factors with little domain knowledge and supervision signals. We demonstrate in a user study and quantitative experiments that LatentKeypointGAN provides an interpretable latent space that can be used to re-arrange the generated images by re-positioning and exchanging keypoint embeddings, such as generating portraits by combining the eyes, and mouth from different images. Notably, our method does not require labels as it is self-supervised and thereby applies to diverse application domains, such as editing portraits, indoor rooms, and full-body human poses.

* CVPR Workshop 2022  
Viaarxiv icon