Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Christian Theobalt

Max Planck Institute for Informatics, Saarland Informatics Campus

MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

Mar 27, 2024

Guoxing Sun, Rishabh Dabral, Pascal Fua, Christian Theobalt, Marc Habermann

Figure 1 for MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

Figure 2 for MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

Figure 3 for MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

Figure 4 for MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

Abstract:Faithful human performance capture and free-view rendering from sparse RGB observations is a long-standing problem in Vision and Graphics. The main challenges are the lack of observations and the inherent ambiguities of the setting, e.g. occlusions and depth ambiguity. As a result, radiance fields, which have shown great promise in capturing high-frequency appearance and geometry details in dense setups, perform poorly when na\"ively supervising them on sparse camera views, as the field simply overfits to the sparse-view inputs. To address this, we propose MetaCap, a method for efficient and high-quality geometry recovery and novel view synthesis given very sparse or even a single view of the human. Our key idea is to meta-learn the radiance field weights solely from potentially sparse multi-view videos, which can serve as a prior when fine-tuning them on sparse imagery depicting the human. This prior provides a good network weight initialization, thereby effectively addressing ambiguities in sparse-view capture. Due to the articulated structure of the human body and motion-induced surface deformations, learning such a prior is non-trivial. Therefore, we propose to meta-learn the field weights in a pose-canonicalized space, which reduces the spatial feature range and makes feature learning more effective. Consequently, one can fine-tune our field parameters to quickly generalize to unseen poses, novel illumination conditions as well as novel and sparse (even monocular) camera views. For evaluating our method under different scenarios, we collect a new dataset, WildDynaCap, which contains subjects captured in, both, a dense camera dome and in-the-wild sparse camera rigs, and demonstrate superior results compared to recent state-of-the-art methods on both public and WildDynaCap dataset.

* Project page: https://vcai.mpi-inf.mpg.de/projects/MetaCap/

Via

Access Paper or Ask Questions

ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

Mar 26, 2024

Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, Christian Theobalt

Figure 1 for ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

Figure 2 for ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

Figure 3 for ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

Figure 4 for ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

Abstract:Gestures play a key role in human communication. Recent methods for co-speech gesture generation, while managing to generate beat-aligned motions, struggle generating gestures that are semantically aligned with the utterance. Compared to beat gestures that align naturally to the audio signal, semantically coherent gestures require modeling the complex interactions between the language and human motion, and can be controlled by focusing on certain words. Therefore, we present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis, which can not only generate gestures based on multi-modal speech inputs, but can also facilitate controllability in gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities (e.g. audio vs text) as well as to choose certain words to be emphasized during gesturing. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures. To further advance the research on multi-party interactive gestures, the DnD Group Gesture dataset is released, which contains 6 hours of gesture data showing 5 people interacting with one another. We compare our method with several recent works and demonstrate effectiveness of our method on a variety of tasks. We urge the reader to watch our supplementary video at our website.

* CVPR 2024. Project Page: https://vcai.mpi-inf.mpg.de/projects/ConvoFusion/

Via

Access Paper or Ask Questions

Recent Trends in 3D Reconstruction of General Non-Rigid Scenes

Mar 22, 2024

Raza Yunus, Jan Eric Lenssen, Michael Niemeyer, Yiyi Liao, Christian Rupprecht, Christian Theobalt, Gerard Pons-Moll, Jia-Bin Huang, Vladislav Golyanik, Eddy Ilg

Abstract:Reconstructing models of the real world, including 3D geometry, appearance, and motion of real scenes, is essential for computer graphics and computer vision. It enables the synthesizing of photorealistic novel views, useful for the movie industry and AR/VR applications. It also facilitates the content creation necessary in computer games and AR/VR by avoiding laborious manual design processes. Further, such models are fundamental for intelligent computing systems that need to interpret real-world scenes and actions to act and interact safely with the human world. Notably, the world surrounding us is dynamic, and reconstructing models of dynamic, non-rigidly moving scenes is a severely underconstrained and challenging problem. This state-of-the-art report (STAR) offers the reader a comprehensive summary of state-of-the-art techniques with monocular and multi-view inputs such as data from RGB and RGB-D sensors, among others, conveying an understanding of different approaches, their potential applications, and promising further research directions. The report covers 3D reconstruction of general non-rigid scenes and further addresses the techniques for scene decomposition, editing and controlling, and generalizable and generative modeling. More specifically, we first review the common and fundamental concepts necessary to understand and navigate the field and then discuss the state-of-the-art techniques by reviewing recent approaches that use traditional and machine-learning-based neural representations, including a discussion on the newly enabled applications. The STAR is concluded with a discussion of the remaining limitations and open challenges.

* 42 pages, 18 figures, 5 tables; State-of-the-Art Report at EUROGRAPHICS 2024

Via

Access Paper or Ask Questions

StyleGaussian: Instant 3D Style Transfer with Gaussian Splatting

Mar 12, 2024

Kunhao Liu, Fangneng Zhan, Muyu Xu, Christian Theobalt, Ling Shao, Shijian Lu

Figure 1 for StyleGaussian: Instant 3D Style Transfer with Gaussian Splatting

Figure 2 for StyleGaussian: Instant 3D Style Transfer with Gaussian Splatting

Figure 3 for StyleGaussian: Instant 3D Style Transfer with Gaussian Splatting

Figure 4 for StyleGaussian: Instant 3D Style Transfer with Gaussian Splatting

Abstract:We introduce StyleGaussian, a novel 3D style transfer technique that allows instant transfer of any image's style to a 3D scene at 10 frames per second (fps). Leveraging 3D Gaussian Splatting (3DGS), StyleGaussian achieves style transfer without compromising its real-time rendering ability and multi-view consistency. It achieves instant style transfer with three steps: embedding, transfer, and decoding. Initially, 2D VGG scene features are embedded into reconstructed 3D Gaussians. Next, the embedded features are transformed according to a reference style image. Finally, the transformed features are decoded into the stylized RGB. StyleGaussian has two novel designs. The first is an efficient feature rendering strategy that first renders low-dimensional features and then maps them into high-dimensional features while embedding VGG features. It cuts the memory consumption significantly and enables 3DGS to render the high-dimensional memory-intensive features. The second is a K-nearest-neighbor-based 3D CNN. Working as the decoder for the stylized features, it eliminates the 2D CNN operations that compromise strict multi-view consistency. Extensive experiments show that StyleGaussian achieves instant 3D stylization with superior stylization quality while preserving real-time rendering and strict multi-view consistency. Project page: https://kunhao-liu.github.io/StyleGaussian/

Via

Access Paper or Ask Questions

Blue noise for diffusion models

Feb 07, 2024

Xingchang Huang, Corentin Salaün, Cristina Vasconcelos, Christian Theobalt, Cengiz Öztireli, Gurprit Singh

Figure 1 for Blue noise for diffusion models

Figure 2 for Blue noise for diffusion models

Figure 3 for Blue noise for diffusion models

Figure 4 for Blue noise for diffusion models

Abstract:Most of the existing diffusion models use Gaussian noise for training and sampling across all time steps, which may not optimally account for the frequency contents reconstructed by the denoising network. Despite the diverse applications of correlated noise in computer graphics, its potential for improving the training process has been underexplored. In this paper, we introduce a novel and general class of diffusion models taking correlated noise within and across images into account. More specifically, we propose a time-varying noise model to incorporate correlated noise into the training process, as well as a method for fast generation of correlated noise mask. Our model is built upon deterministic diffusion models and utilizes blue noise to help improve the generation quality compared to using Gaussian white (random) noise only. Further, our framework allows introducing correlation across images within a single mini-batch to improve gradient flow. We perform both qualitative and quantitative evaluations on a variety of datasets using our method, achieving improvements on different tasks over existing deterministic diffusion models in terms of FID metric.

* 10 pages, 12 figures

Via

Access Paper or Ask Questions

3D Human Pose Perception from Egocentric Stereo Videos

Dec 30, 2023

Hiroyasu Akada, Jian Wang, Vladislav Golyanik, Christian Theobalt

Figure 1 for 3D Human Pose Perception from Egocentric Stereo Videos

Figure 2 for 3D Human Pose Perception from Egocentric Stereo Videos

Figure 3 for 3D Human Pose Perception from Egocentric Stereo Videos

Figure 4 for 3D Human Pose Perception from Egocentric Stereo Videos

Abstract:While head-mounted devices are becoming more compact, they provide egocentric views with significant self-occlusions of the device user. Hence, existing methods often fail to accurately estimate complex 3D poses from egocentric views. In this work, we propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation, which leverages the scene information and temporal context of egocentric stereo videos. Specifically, we utilize 1) depth features from our 3D scene reconstruction module with uniformly sampled windows of egocentric stereo frames, and 2) human joint queries enhanced by temporal features of the video inputs. Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting. Furthermore, we introduce two new benchmark datasets, i.e., UnrealEgo2 and UnrealEgo-RW (RealWorld). The proposed datasets offer a much larger number of egocentric stereo views with a wider variety of human motions than the existing datasets, allowing comprehensive evaluation of existing and upcoming methods. Our extensive experiments show that the proposed approach significantly outperforms previous methods. We will release UnrealEgo2, UnrealEgo-RW, and trained models on our project page.

Via

Access Paper or Ask Questions

MACS: Mass Conditioned 3D Hand and Object Motion Synthesis

Dec 22, 2023

Soshi Shimada, Franziska Mueller, Jan Bednarik, Bardia Doosti, Bernd Bickel, Danhang Tang, Vladislav Golyanik, Jonathan Taylor, Christian Theobalt, Thabo Beeler

Figure 1 for MACS: Mass Conditioned 3D Hand and Object Motion Synthesis

Figure 2 for MACS: Mass Conditioned 3D Hand and Object Motion Synthesis

Figure 3 for MACS: Mass Conditioned 3D Hand and Object Motion Synthesis

Figure 4 for MACS: Mass Conditioned 3D Hand and Object Motion Synthesis

Abstract:The physical properties of an object, such as mass, significantly affect how we manipulate it with our hands. Surprisingly, this aspect has so far been neglected in prior work on 3D motion synthesis. To improve the naturalness of the synthesized 3D hand object motions, this work proposes MACS the first MAss Conditioned 3D hand and object motion Synthesis approach. Our approach is based on cascaded diffusion models and generates interactions that plausibly adjust based on the object mass and interaction type. MACS also accepts a manually drawn 3D object trajectory as input and synthesizes the natural 3D hand motions conditioned by the object mass. This flexibility enables MACS to be used for various downstream applications, such as generating synthetic training data for ML tasks, fast animation of hands for graphics workflows, and generating character interactions for computer games. We show experimentally that a small-scale dataset is sufficient for MACS to reasonably generalize across interpolated and extrapolated object masses unseen during the training. Furthermore, MACS shows moderate generalization to unseen objects, thanks to the mass-conditioned contact labels generated by our surface contact synthesis model ConNet. Our comprehensive user study confirms that the synthesized 3D hand-object interactions are highly plausible and realistic.

Via

Access Paper or Ask Questions

3D Pose Estimation of Two Interacting Hands from a Monocular Event Camera

Dec 21, 2023

Christen Millerdurai, Diogo Luvizon, Viktor Rudnev, André Jonas, Jiayi Wang, Christian Theobalt, Vladislav Golyanik

Figure 1 for 3D Pose Estimation of Two Interacting Hands from a Monocular Event Camera

Figure 2 for 3D Pose Estimation of Two Interacting Hands from a Monocular Event Camera

Figure 3 for 3D Pose Estimation of Two Interacting Hands from a Monocular Event Camera

Figure 4 for 3D Pose Estimation of Two Interacting Hands from a Monocular Event Camera

Abstract:3D hand tracking from a monocular video is a very challenging problem due to hand interactions, occlusions, left-right hand ambiguity, and fast motion. Most existing methods rely on RGB inputs, which have severe limitations under low-light conditions and suffer from motion blur. In contrast, event cameras capture local brightness changes instead of full image frames and do not suffer from the described effects. Unfortunately, existing image-based techniques cannot be directly applied to events due to significant differences in the data modalities. In response to these challenges, this paper introduces the first framework for 3D tracking of two fast-moving and interacting hands from a single monocular event camera. Our approach tackles the left-right hand ambiguity with a novel semi-supervised feature-wise attention mechanism and integrates an intersection loss to fix hand collisions. To facilitate advances in this research domain, we release a new synthetic large-scale dataset of two interacting hands, Ev2Hands-S, and a new real benchmark with real event streams and ground-truth 3D annotations, Ev2Hands-R. Our approach outperforms existing methods in terms of the 3D reconstruction accuracy and generalises to real data under severe light conditions.

* International Conference on 3D Vision (3DV) 2024
* 17 pages, 12 figures, 7 tables; project page: https://4dqv.mpi-inf.mpg.de/Ev2Hands/

Via

Access Paper or Ask Questions

Relightable Neural Actor with Intrinsic Decomposition and Pose Control

Dec 18, 2023

Diogo Luvizon, Vladislav Golyanik, Adam Kortylewski, Marc Habermann, Christian Theobalt

Abstract:Creating a digital human avatar that is relightable, drivable, and photorealistic is a challenging and important problem in Vision and Graphics. Humans are highly articulated creating pose-dependent appearance effects like self-shadows and wrinkles, and skin as well as clothing require complex and space-varying BRDF models. While recent human relighting approaches can recover plausible material-light decompositions from multi-view video, they do not generalize to novel poses and still suffer from visual artifacts. To address this, we propose Relightable Neural Actor, the first video-based method for learning a photorealistic neural human model that can be relighted, allows appearance editing, and can be controlled by arbitrary skeletal poses. Importantly, for learning our human avatar, we solely require a multi-view recording of the human under a known, but static lighting condition. To achieve this, we represent the geometry of the actor with a drivable density field that models pose-dependent clothing deformations and provides a mapping between 3D and UV space, where normal, visibility, and materials are encoded. To evaluate our approach in real-world scenarios, we collect a new dataset with four actors recorded under different light conditions, indoors and outdoors, providing the first benchmark of its kind for human relighting, and demonstrating state-of-the-art relighting results for novel human poses.

* Project page: https://people.mpi-inf.mpg.de/~dluvizon/relightable-neural-actor/

Via

Access Paper or Ask Questions

Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras

Dec 12, 2023

Ashwath Shetty, Marc Habermann, Guoxing Sun, Diogo Luvizon, Vladislav Golyanik, Christian Theobalt

Figure 1 for Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras

Figure 2 for Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras

Figure 3 for Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras

Figure 4 for Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras

Abstract:We present the first approach to render highly realistic free-viewpoint videos of a human actor in general apparel, from sparse multi-view recording to display, in real-time at an unprecedented 4K resolution. At inference, our method only requires four camera views of the moving actor and the respective 3D skeletal pose. It handles actors in wide clothing, and reproduces even fine-scale dynamic detail, e.g. clothing wrinkles, face expressions, and hand gestures. At training time, our learning-based approach expects dense multi-view video and a rigged static surface scan of the actor. Our method comprises three main stages. Stage 1 is a skeleton-driven neural approach for high-quality capture of the detailed dynamic mesh geometry. Stage 2 is a novel solution to create a view-dependent texture using four test-time camera views as input. Finally, stage 3 comprises a new image-based refinement network rendering the final 4K image given the output from the previous stages. Our approach establishes a new benchmark for real-time rendering resolution and quality using sparse input camera views, unlocking possibilities for immersive telepresence.

* Project page: https://vcai.mpi-inf.mpg.de/projects/holochar/

Via

Access Paper or Ask Questions