Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Youngjung Uh

Hand-4DGS: Feed-Forward 3D Gaussian Splatting for 4D Hand Reconstruction from Egocentric Videos

Jun 17, 2026

Jeongmin Bae, Seoha Kim, Marc Pollefeys, Mahdi Rad, Youngjung Uh, Taein Kwon

Abstract:Dynamic 3D hand reconstruction from egocentric videos is essential for next-generation computing platforms such as AR/VR and AI glasses. Despite its importance, most prior works focus either on multi-view 3D hand reconstruction or on 4D human body reconstruction. Egocentric 4D hand reconstruction remains challenging due to fast head motion, rapid hand dynamics, severe occlusions, and inherent ambiguity from single-view observations. To address these challenges, we introduce Hand-4DGS, the first feed-forward framework for reconstructing dynamic 4D hands directly from egocentric videos, enabling both fast (~60 FPS) inference and strong generalization. Our approach incorporates a mesh-guided representation for structural priors and temporal convolutions to model dynamic motion. We evaluate our framework on two challenging egocentric datasets, H2O and ARCTIC, and demonstrate significant improvements over baselines. Our method benefits from the generalization capability of feed-forward networks and effective 2D image supervision through Gaussian splatting, without requiring expensive 3D hand pose ground-truth annotations.

* Project page: https://jeongminb.github.io/hand-4dgs/

Via

Access Paper or Ask Questions

FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation

Dec 31, 2025

Jibin Song, Mingi Kwon, Jaeseok Jeong, Youngjung Uh

Abstract:In this work, we show that the impact of model capacity varies across timesteps: it is crucial for the early and late stages but largely negligible during the intermediate stage. Accordingly, we propose FlowBlending, a stage-aware multi-model sampling strategy that employs a large model and a small model at capacity-sensitive stages and intermediate stages, respectively. We further introduce simple criteria to choose stage boundaries and provide a velocity-divergence analysis as an effective proxy for identifying capacity-sensitive regions. Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B), FlowBlending achieves up to 1.65x faster inference with 57.35% fewer FLOPs, while maintaining the visual fidelity, temporal coherence, and semantic alignment of the large models. FlowBlending is also compatible with existing sampling-acceleration techniques, enabling up to 2x additional speedup. Project page is available at: https://jibin86.github.io/flowblending_project_page.

* Project page: https://jibin86.github.io/flowblending_project_page

Via

Access Paper or Ask Questions

ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation

Dec 29, 2025

Shin seong Kim, Minjung Shin, Hyunin Cho, Youngjung Uh

Abstract:Recent text-to-image diffusion models have significantly improved visual quality and text alignment. However, generating a sequence of images while preserving consistent character identity across diverse scene descriptions remains a challenging task. Existing methods often struggle with a trade-off between maintaining identity consistency and ensuring per-image prompt alignment. In this paper, we introduce a novel framework, ASemconsist, that addresses this challenge through selective text embedding modification, enabling explicit semantic control over character identity without sacrificing prompt alignment. Furthermore, based on our analysis of padding embeddings in FLUX, we propose a semantic control strategy that repurposes padding embeddings as semantic containers. Additionally, we introduce an adaptive feature-sharing strategy that automatically evaluates textual ambiguity and applies constraints only to the ambiguous identity prompt. Finally, we propose a unified evaluation protocol, the Consistency Quality Score (CQS), which integrates identity preservation and per-image text alignment into a single comprehensive metric, explicitly capturing performance imbalances between the two metrics. Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs. Project page: https://minjung-s.github.io/asemconsist

Via

Access Paper or Ask Questions

Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Dec 18, 2025

Shangxun Li, Youngjung Uh

Figure 1 for Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Figure 2 for Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Figure 3 for Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Figure 4 for Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

Abstract:Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.

Via

Access Paper or Ask Questions

Compensating Spatiotemporally Inconsistent Observations for Online Dynamic 3D Gaussian Splatting

May 02, 2025

Youngsik Yun, Jeongmin Bae, Hyunseung Son, Seoha Kim, Hahyun Lee, Gun Bang, Youngjung Uh

Figure 1 for Compensating Spatiotemporally Inconsistent Observations for Online Dynamic 3D Gaussian Splatting

Figure 2 for Compensating Spatiotemporally Inconsistent Observations for Online Dynamic 3D Gaussian Splatting

Figure 3 for Compensating Spatiotemporally Inconsistent Observations for Online Dynamic 3D Gaussian Splatting

Figure 4 for Compensating Spatiotemporally Inconsistent Observations for Online Dynamic 3D Gaussian Splatting

Abstract:Online reconstruction of dynamic scenes is significant as it enables learning scenes from live-streaming video inputs, while existing offline dynamic reconstruction methods rely on recorded video inputs. However, previous online reconstruction approaches have primarily focused on efficiency and rendering quality, overlooking the temporal consistency of their results, which often contain noticeable artifacts in static regions. This paper identifies that errors such as noise in real-world recordings affect temporal inconsistency in online reconstruction. We propose a method that enhances temporal consistency in online reconstruction from observations with temporal inconsistency which is inevitable in cameras. We show that our method restores the ideal observation by subtracting the learned error. We demonstrate that applying our method to various baselines significantly enhances both temporal consistency and rendering quality across datasets. Code, video results, and checkpoints are available at https://bbangsik13.github.io/OR2.

* SIGGRAPH 2025, Project page: https://bbangsik13.github.io/OR2

Via

Access Paper or Ask Questions

4D Scaffold Gaussian Splatting for Memory Efficient Dynamic Scene Reconstruction

Nov 26, 2024

Woong Oh Cho, In Cho, Seoha Kim, Jeongmin Bae, Youngjung Uh, Seon Joo Kim

Figure 1 for 4D Scaffold Gaussian Splatting for Memory Efficient Dynamic Scene Reconstruction

Figure 2 for 4D Scaffold Gaussian Splatting for Memory Efficient Dynamic Scene Reconstruction

Figure 3 for 4D Scaffold Gaussian Splatting for Memory Efficient Dynamic Scene Reconstruction

Figure 4 for 4D Scaffold Gaussian Splatting for Memory Efficient Dynamic Scene Reconstruction

Abstract:Existing 4D Gaussian methods for dynamic scene reconstruction offer high visual fidelity and fast rendering. However, these methods suffer from excessive memory and storage demands, which limits their practical deployment. This paper proposes a 4D anchor-based framework that retains visual quality and rendering speed of 4D Gaussians while significantly reducing storage costs. Our method extends 3D scaffolding to 4D space, and leverages sparse 4D grid-aligned anchors with compressed feature vectors. Each anchor models a set of neural 4D Gaussians, each of which represent a local spatiotemporal region. In addition, we introduce a temporal coverage-aware anchor growing strategy to effectively assign additional anchors to under-reconstructed dynamic regions. Our method adjusts the accumulated gradients based on Gaussians' temporal coverage, improving reconstruction quality in dynamic regions. To reduce the number of anchors, we further present enhanced formulations of neural 4D Gaussians. These include the neural velocity, and the temporal opacity derived from a generalized Gaussian distribution. Experimental results demonstrate that our method achieves state-of-the-art visual quality and 97.8% storage reduction over 4DGS.

Via

Access Paper or Ask Questions

HARIVO: Harnessing Text-to-Image Models for Video Generation

Oct 10, 2024

Mingi Kwon, Seoung Wug Oh, Yang Zhou, Difan Liu, Joon-Young Lee, Haoran Cai, Baqiao Liu, Feng Liu, Youngjung Uh

Figure 1 for HARIVO: Harnessing Text-to-Image Models for Video Generation

Figure 2 for HARIVO: Harnessing Text-to-Image Models for Video Generation

Figure 3 for HARIVO: Harnessing Text-to-Image Models for Video Generation

Figure 4 for HARIVO: Harnessing Text-to-Image Models for Video Generation

Abstract:We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. project page: https://kwonminki.github.io/HARIVO

* ECCV2024

Via

Access Paper or Ask Questions

FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering

Aug 23, 2024

Yunji Seo, Young Sun Choi, Hyun Seung Son, Youngjung Uh

Figure 1 for FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering

Figure 2 for FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering

Figure 3 for FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering

Figure 4 for FLoD: Integrating Flexible Level of Detail into 3D Gaussian Splatting for Customizable Rendering

Abstract:3D Gaussian Splatting (3DGS) achieves fast and high-quality renderings by using numerous small Gaussians, which leads to significant memory consumption. This reliance on a large number of Gaussians restricts the application of 3DGS-based models on low-cost devices due to memory limitations. However, simply reducing the number of Gaussians to accommodate devices with less memory capacity leads to inferior quality compared to the quality that can be achieved on high-end hardware. To address this lack of scalability, we propose integrating a Flexible Level of Detail (FLoD) to 3DGS, to allow a scene to be rendered at varying levels of detail according to hardware capabilities. While existing 3DGSs with LoD focus on detailed reconstruction, our method provides reconstructions using a small number of Gaussians for reduced memory requirements, and a larger number of Gaussians for greater detail. Experiments demonstrate our various rendering options with tradeoffs between rendering quality and memory usage, thereby allowing real-time rendering across different memory constraints. Furthermore, we show that our method generalizes to different 3DGS frameworks, indicating its potential for integration into future state-of-the-art developments. Project page: https://3dgs-flod.github.io/flod.github.io/

* Project page: https://3dgs-flod.github.io/flod.github.io/

Via

Access Paper or Ask Questions

Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Aug 14, 2024

Hyunjee Lee, Youngsik Yun, Jeongmin Bae, Seoha Kim, Youngjung Uh

Figure 1 for Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Figure 2 for Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Figure 3 for Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Figure 4 for Rethinking Open-Vocabulary Segmentation of Radiance Fields in 3D Space

Abstract:Understanding the 3D semantics of a scene is a fundamental problem for various scenarios such as embodied agents. While NeRFs and 3DGS excel at novel-view synthesis, previous methods for understanding their semantics have been limited to incomplete 3D understanding: their segmentation results are 2D masks and their supervision is anchored at 2D pixels. This paper revisits the problem set to pursue a better 3D understanding of a scene modeled by NeRFs and 3DGS as follows. 1) We directly supervise the 3D points to train the language embedding field. It achieves state-of-the-art accuracy without relying on multi-scale language embeddings. 2) We transfer the pre-trained language field to 3DGS, achieving the first real-time rendering speed without sacrificing training time or accuracy. 3) We introduce a 3D querying and evaluation protocol for assessing the reconstructed geometry and semantics together. Code, checkpoints, and annotations will be available online. Project page: https://hyunji12.github.io/Open3DRF

* Project page: https://hyunji12.github.io/Open3DRF

Via

Access Paper or Ask Questions

Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models

Jun 11, 2024

Sooyeon Go, Kyungmook Choi, Minjung Shin, Youngjung Uh

Figure 1 for Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models

Figure 2 for Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models

Figure 3 for Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models

Figure 4 for Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models

Abstract:As pretrained text-to-image diffusion models have become a useful tool for image synthesis, people want to specify the results in various ways. In this paper, we introduce a method to produce results with the same structure of a target image but painted with colors from a reference image, i.e., appearance transfer, especially following the semantic correspondence between the result and the reference. E.g., the result wing takes color from the reference wing, not the reference head. Existing methods rely on the query-key similarity within self-attention layer, usually producing defective results. To this end, we propose to find semantic correspondences and explicitly rearrange the features according to the semantic correspondences. Extensive experiments show the superiority of our method in various aspects: preserving the structure of the target and reflecting the color from the reference according to the semantic correspondences, even when the two images are not aligned.

* project page : https://sooyeon-go.github.io/eye_for_an_eye/

Via

Access Paper or Ask Questions