Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marc Habermann

RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance

Mar 15, 2025

Yuheng Jiang, Zhehao Shen, Chengcheng Guo, Yu Hong, Zhuo Su, Yingliang Zhang, Marc Habermann, Lan Xu

Figure 1 for RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance

Figure 2 for RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance

Figure 3 for RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance

Figure 4 for RePerformer: Immersive Human-centric Volumetric Videos from Playback to Photoreal Reperformance

Abstract:Human-centric volumetric videos offer immersive free-viewpoint experiences, yet existing methods focus either on replaying general dynamic scenes or animating human avatars, limiting their ability to re-perform general dynamic scenes. In this paper, we present RePerformer, a novel Gaussian-based representation that unifies playback and re-performance for high-fidelity human-centric volumetric videos. Specifically, we hierarchically disentangle the dynamic scenes into motion Gaussians and appearance Gaussians which are associated in the canonical space. We further employ a Morton-based parameterization to efficiently encode the appearance Gaussians into 2D position and attribute maps. For enhanced generalization, we adopt 2D CNNs to map position maps to attribute maps, which can be assembled into appearance Gaussians for high-fidelity rendering of the dynamic scenes. For re-performance, we develop a semantic-aware alignment module and apply deformation transfer on motion Gaussians, enabling photo-real rendering under novel motions. Extensive experiments validate the robustness and effectiveness of RePerformer, setting a new benchmark for playback-then-reperformance paradigm in human-centric volumetric videos.

* Accepted by CVPR 2025. Project Page: https://moqiyinlun.github.io/Reperformer/

Via

Access Paper or Ask Questions

Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures

Dec 17, 2024

Guoxing Sun, Rishabh Dabral, Heming Zhu, Pascal Fua, Christian Theobalt, Marc Habermann

Figure 1 for Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures

Figure 2 for Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures

Figure 3 for Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures

Figure 4 for Real-time Free-view Human Rendering from Sparse-view RGB Videos using Double Unprojected Textures

Abstract:Real-time free-view human rendering from sparse-view RGB inputs is a challenging task due to the sensor scarcity and the tight time budget. To ensure efficiency, recent methods leverage 2D CNNs operating in texture space to learn rendering primitives. However, they either jointly learn geometry and appearance, or completely ignore sparse image information for geometry estimation, significantly harming visual quality and robustness to unseen body poses. To address these issues, we present Double Unprojected Textures, which at the core disentangles coarse geometric deformation estimation from appearance synthesis, enabling robust and photorealistic 4K rendering in real-time. Specifically, we first introduce a novel image-conditioned template deformation network, which estimates the coarse deformation of the human template from a first unprojected texture. This updated geometry is then used to apply a second and more accurate texture unprojection. The resulting texture map has fewer artifacts and better alignment with input views, which benefits our learning of finer-level geometry and appearance represented by Gaussian splats. We validate the effectiveness and efficiency of the proposed method in quantitative and qualitative experiments, which significantly surpasses other state-of-the-art methods.

* Project page: https://vcai.mpi-inf.mpg.de/projects/DUT/

Via

Access Paper or Ask Questions

BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects

Dec 06, 2024

Wanyue Zhang, Rishabh Dabral, Vladislav Golyanik, Vasileios Choutas, Eduardo Alvarado, Thabo Beeler, Marc Habermann, Christian Theobalt

Figure 1 for BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects

Figure 2 for BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects

Figure 3 for BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects

Figure 4 for BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction with Articulated Objects

Abstract:We present BimArt, a novel generative approach for synthesizing 3D bimanual hand interactions with articulated objects. Unlike prior works, we do not rely on a reference grasp, a coarse hand trajectory, or separate modes for grasping and articulating. To achieve this, we first generate distance-based contact maps conditioned on the object trajectory with an articulation-aware feature representation, revealing rich bimanual patterns for manipulation. The learned contact prior is then used to guide our hand motion generator, producing diverse and realistic bimanual motions for object movement and articulation. Our work offers key insights into feature representation and contact prior for articulated objects, demonstrating their effectiveness in taming the complex, high-dimensional space of bimanual hand-object interactions. Through comprehensive quantitative experiments, we demonstrate a clear step towards simplified and high-quality hand-object animations that excel over the state-of-the-art in motion quality and diversity.

Via

Access Paper or Ask Questions

EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild

Nov 21, 2024

Yumeng Liu, Xiaoxiao Long, Zemin Yang, Yuan Liu, Marc Habermann, Christian Theobalt, Yuexin Ma, Wenping Wang

Figure 1 for EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild

Figure 2 for EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild

Figure 3 for EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild

Figure 4 for EasyHOI: Unleashing the Power of Large Models for Reconstructing Hand-Object Interactions in the Wild

Abstract:Our work aims to reconstruct hand-object interactions from a single-view image, which is a fundamental but ill-posed task. Unlike methods that reconstruct from videos, multi-view images, or predefined 3D templates, single-view reconstruction faces significant challenges due to inherent ambiguities and occlusions. These challenges are further amplified by the diverse nature of hand poses and the vast variety of object shapes and sizes. Our key insight is that current foundational models for segmentation, inpainting, and 3D reconstruction robustly generalize to in-the-wild images, which could provide strong visual and geometric priors for reconstructing hand-object interactions. Specifically, given a single image, we first design a novel pipeline to estimate the underlying hand pose and object shape using off-the-shelf large models. Furthermore, with the initial reconstruction, we employ a prior-guided optimization scheme, which optimizes hand pose to comply with 3D physical constraints and the 2D input image content. We perform experiments across several datasets and show that our method consistently outperforms baselines and faithfully reconstructs a diverse set of hand-object interactions. Here is the link of our project page: https://lym29.github.io/EasyHOI-page/

* Project page: https://lym29.github.io/EasyHOI-page/

Via

Access Paper or Ask Questions

PocoLoco: A Point Cloud Diffusion Model of Human Shape in Loose Clothing

Nov 06, 2024

Siddharth Seth, Rishabh Dabral, Diogo Luvizon, Marc Habermann, Ming-Hsuan Yang, Christian Theobalt, Adam Kortylewski

Figure 1 for PocoLoco: A Point Cloud Diffusion Model of Human Shape in Loose Clothing

Figure 2 for PocoLoco: A Point Cloud Diffusion Model of Human Shape in Loose Clothing

Figure 3 for PocoLoco: A Point Cloud Diffusion Model of Human Shape in Loose Clothing

Figure 4 for PocoLoco: A Point Cloud Diffusion Model of Human Shape in Loose Clothing

Abstract:Modeling a human avatar that can plausibly deform to articulations is an active area of research. We present PocoLoco -- the first template-free, point-based, pose-conditioned generative model for 3D humans in loose clothing. We motivate our work by noting that most methods require a parametric model of the human body to ground pose-dependent deformations. Consequently, they are restricted to modeling clothing that is topologically similar to the naked body and do not extend well to loose clothing. The few methods that attempt to model loose clothing typically require either canonicalization or a UV-parameterization and need to address the challenging problem of explicitly estimating correspondences for the deforming clothes. In this work, we formulate avatar clothing deformation as a conditional point-cloud generation task within the denoising diffusion framework. Crucially, our framework operates directly on unordered point clouds, eliminating the need for a parametric model or a clothing template. This also enables a variety of practical applications, such as point-cloud completion and pose-based editing -- important features for virtual human animation. As current datasets for human avatars in loose clothing are far too small for training diffusion models, we release a dataset of two subjects performing various poses in loose clothing with a total of 75K point clouds. By contributing towards tackling the challenging task of effectively modeling loose clothing and expanding the available data for training these models, we aim to set the stage for further innovation in digital humans. The source code is available at https://github.com/sidsunny/pocoloco .

* WACV 2025

Via

Access Paper or Ask Questions

GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations

Sep 18, 2024

Kartik Teotia, Hyeongwoo Kim, Pablo Garrido, Marc Habermann, Mohamed Elgharib, Christian Theobalt

Abstract:Real-time rendering of human head avatars is a cornerstone of many computer graphics applications, such as augmented reality, video games, and films, to name a few. Recent approaches address this challenge with computationally efficient geometry primitives in a carefully calibrated multi-view setup. Albeit producing photorealistic head renderings, it often fails to represent complex motion changes such as the mouth interior and strongly varying head poses. We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real-time. At the core of our method is a hierarchical representation of head models that allows to capture the complex dynamics of facial expressions and head movements. First, with rich facial features extracted from raw input frames, we learn to deform the coarse facial geometry of the template mesh. We then initialize 3D Gaussians on the deformed surface and refine their positions in a fine step. We train this coarse-to-fine facial avatar model along with the head pose as a learnable parameter in an end-to-end framework. This enables not only controllable facial animation via video inputs, but also high-fidelity novel view synthesis of challenging facial expressions, such as tongue deformations and fine-grained teeth structure under large motion changes. Moreover, it encourages the learned head avatar to generalize towards new facial expressions and head poses at inference time. We demonstrate the performance of our method with comparisons against the related methods on different datasets, spanning challenging facial expression sequences across multiple identities. We also show the potential application of our approach by demonstrating a cross-identity facial performance transfer application.

* ACM Transaction on Graphics (SIGGRAPH Asia 2024); Project page: https://vcai.mpi-inf.mpg.de/projects/GaussianHeads/

Via

Access Paper or Ask Questions

A Latent Implicit 3D Shape Model for Multiple Levels of Detail

Sep 10, 2024

Benoit Guillard, Marc Habermann, Christian Theobalt, Pascal Fua

Figure 1 for A Latent Implicit 3D Shape Model for Multiple Levels of Detail

Figure 2 for A Latent Implicit 3D Shape Model for Multiple Levels of Detail

Figure 3 for A Latent Implicit 3D Shape Model for Multiple Levels of Detail

Figure 4 for A Latent Implicit 3D Shape Model for Multiple Levels of Detail

Abstract:Implicit neural representations map a shape-specific latent code and a 3D coordinate to its corresponding signed distance (SDF) value. However, this approach only offers a single level of detail. Emulating low levels of detail can be achieved with shallow networks, but the generated shapes are typically not smooth. Alternatively, some network designs offer multiple levels of detail, but are limited to overfitting a single object. To address this, we propose a new shape modeling approach, which enables multiple levels of detail and guarantees a smooth surface at each level. At the core, we introduce a novel latent conditioning for a multiscale and bandwith-limited neural architecture. This results in a deep parameterization of multiple shapes, where early layers quickly output approximated SDF values. This allows to balance speed and accuracy within a single network and enhance the efficiency of implicit scene rendering. We demonstrate that by limiting the bandwidth of the network, we can maintain smooth surfaces across all levels of detail. At finer levels, reconstruction quality is on par with the state of the art models, which are limited to a single level of detail.

* Published in GCPR 2024 proceedings

Via

Access Paper or Ask Questions

TEDRA: Text-based Editing of Dynamic and Photoreal Actors

Aug 28, 2024

Basavaraj Sunagad, Heming Zhu, Mohit Mendiratta, Adam Kortylewski, Christian Theobalt, Marc Habermann

Figure 1 for TEDRA: Text-based Editing of Dynamic and Photoreal Actors

Figure 2 for TEDRA: Text-based Editing of Dynamic and Photoreal Actors

Figure 3 for TEDRA: Text-based Editing of Dynamic and Photoreal Actors

Figure 4 for TEDRA: Text-based Editing of Dynamic and Photoreal Actors

Abstract:Over the past years, significant progress has been made in creating photorealistic and drivable 3D avatars solely from videos of real humans. However, a core remaining challenge is the fine-grained and user-friendly editing of clothing styles by means of textual descriptions. To this end, we present TEDRA, the first method allowing text-based edits of an avatar, which maintains the avatar's high fidelity, space-time coherency, as well as dynamics, and enables skeletal pose and view control. We begin by training a model to create a controllable and high-fidelity digital replica of the real actor. Next, we personalize a pretrained generative diffusion model by fine-tuning it on various frames of the real character captured from different camera angles, ensuring the digital representation faithfully captures the dynamics and movements of the real person. This two-stage process lays the foundation for our approach to dynamic human avatar editing. Utilizing this personalized diffusion model, we modify the dynamic avatar based on a provided text prompt using our Personalized Normal Aligned Score Distillation Sampling (PNA-SDS) within a model-based guidance framework. Additionally, we propose a time step annealing strategy to ensure high-quality edits. Our results demonstrate a clear improvement over prior work in functionality and visual quality.

* For project page, see this https://vcai.mpi-inf.mpg.de/projects/Tedra

Via

Access Paper or Ask Questions

MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

Mar 27, 2024

Guoxing Sun, Rishabh Dabral, Pascal Fua, Christian Theobalt, Marc Habermann

Figure 1 for MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

Figure 2 for MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

Figure 3 for MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

Figure 4 for MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering

Abstract:Faithful human performance capture and free-view rendering from sparse RGB observations is a long-standing problem in Vision and Graphics. The main challenges are the lack of observations and the inherent ambiguities of the setting, e.g. occlusions and depth ambiguity. As a result, radiance fields, which have shown great promise in capturing high-frequency appearance and geometry details in dense setups, perform poorly when na\"ively supervising them on sparse camera views, as the field simply overfits to the sparse-view inputs. To address this, we propose MetaCap, a method for efficient and high-quality geometry recovery and novel view synthesis given very sparse or even a single view of the human. Our key idea is to meta-learn the radiance field weights solely from potentially sparse multi-view videos, which can serve as a prior when fine-tuning them on sparse imagery depicting the human. This prior provides a good network weight initialization, thereby effectively addressing ambiguities in sparse-view capture. Due to the articulated structure of the human body and motion-induced surface deformations, learning such a prior is non-trivial. Therefore, we propose to meta-learn the field weights in a pose-canonicalized space, which reduces the spatial feature range and makes feature learning more effective. Consequently, one can fine-tune our field parameters to quickly generalize to unseen poses, novel illumination conditions as well as novel and sparse (even monocular) camera views. For evaluating our method under different scenarios, we collect a new dataset, WildDynaCap, which contains subjects captured in, both, a dense camera dome and in-the-wild sparse camera rigs, and demonstrate superior results compared to recent state-of-the-art methods on both public and WildDynaCap dataset.

* Project page: https://vcai.mpi-inf.mpg.de/projects/MetaCap/

Via

Access Paper or Ask Questions

ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis

Mar 26, 2024

Muhammad Hamza Mughal, Rishabh Dabral, Ikhsanul Habibie, Lucia Donatelli, Marc Habermann, Christian Theobalt

Abstract:Gestures play a key role in human communication. Recent methods for co-speech gesture generation, while managing to generate beat-aligned motions, struggle generating gestures that are semantically aligned with the utterance. Compared to beat gestures that align naturally to the audio signal, semantically coherent gestures require modeling the complex interactions between the language and human motion, and can be controlled by focusing on certain words. Therefore, we present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis, which can not only generate gestures based on multi-modal speech inputs, but can also facilitate controllability in gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities (e.g. audio vs text) as well as to choose certain words to be emphasized during gesturing. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures. To further advance the research on multi-party interactive gestures, the DnD Group Gesture dataset is released, which contains 6 hours of gesture data showing 5 people interacting with one another. We compare our method with several recent works and demonstrate effectiveness of our method on a variety of tasks. We urge the reader to watch our supplementary video at our website.

* CVPR 2024. Project Page: https://vcai.mpi-inf.mpg.de/projects/ConvoFusion/

Via

Access Paper or Ask Questions