Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lingjie Liu

Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation

Jan 09, 2025

Xuyi Meng, Chen Wang, Jiahui Lei, Kostas Daniilidis, Jiatao Gu, Lingjie Liu

Figure 1 for Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation

Figure 2 for Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation

Figure 3 for Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation

Figure 4 for Zero-1-to-G: Taming Pretrained 2D Diffusion Model for Direct 3D Generation

Abstract:Recent advances in 2D image generation have achieved remarkable quality,largely driven by the capacity of diffusion models and the availability of large-scale datasets. However, direct 3D generation is still constrained by the scarcity and lower fidelity of 3D datasets. In this paper, we introduce Zero-1-to-G, a novel approach that addresses this problem by enabling direct single-view generation on Gaussian splats using pretrained 2D diffusion models. Our key insight is that Gaussian splats, a 3D representation, can be decomposed into multi-view images encoding different attributes. This reframes the challenging task of direct 3D generation within a 2D diffusion framework, allowing us to leverage the rich priors of pretrained 2D diffusion models. To incorporate 3D awareness, we introduce cross-view and cross-attribute attention layers, which capture complex correlations and enforce 3D consistency across generated splats. This makes Zero-1-to-G the first direct image-to-3D generative model to effectively utilize pretrained 2D diffusion priors, enabling efficient training and improved generalization to unseen objects. Extensive experiments on both synthetic and in-the-wild datasets demonstrate superior performance in 3D object generation, offering a new approach to high-quality 3D generation.

Via

Access Paper or Ask Questions

ProTracker: Probabilistic Integration for Robust and Accurate Point Tracking

Jan 06, 2025

Tingyang Zhang, Chen Wang, Zhiyang Dou, Qingzhe Gao, Jiahui Lei, Baoquan Chen, Lingjie Liu

Abstract:In this paper, we propose ProTracker, a novel framework for robust and accurate long-term dense tracking of arbitrary points in videos. The key idea of our method is incorporating probabilistic integration to refine multiple predictions from both optical flow and semantic features for robust short-term and long-term tracking. Specifically, we integrate optical flow estimations in a probabilistic manner, producing smooth and accurate trajectories by maximizing the likelihood of each prediction. To effectively re-localize challenging points that disappear and reappear due to occlusion, we further incorporate long-term feature correspondence into our flow predictions for continuous trajectory generation. Extensive experiments show that ProTracker achieves the state-of-the-art performance among unsupervised and self-supervised approaches, and even outperforms supervised methods on several benchmarks. Our code and model will be publicly available upon publication.

* Project page: https://michaelszj.github.io/protracker

Via

Access Paper or Ask Questions

Extrapolated Urban View Synthesis Benchmark

Dec 10, 2024

Xiangyu Han, Zhen Jia, Boyi Li, Yan Wang, Boris Ivanovic, Yurong You, Lingjie Liu, Yue Wang, Marco Pavone, Chen Feng(+1 more)

Figure 1 for Extrapolated Urban View Synthesis Benchmark

Figure 2 for Extrapolated Urban View Synthesis Benchmark

Figure 3 for Extrapolated Urban View Synthesis Benchmark

Figure 4 for Extrapolated Urban View Synthesis Benchmark

Abstract:Photorealistic simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis (NVS), a crucial capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs. Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at real-time speeds and have been widely used in modeling large-scale driving scenes. However, their performance is commonly evaluated using an interpolated setup with highly correlated training and test views. In contrast, extrapolation, where test views largely deviate from training views, remains underexplored, limiting progress in generalizable simulation technology. To address this gap, we leverage publicly available AV datasets with multiple traversals, multiple vehicles, and multiple cameras to build the first Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct quantitative and qualitative evaluations of state-of-the-art Gaussian Splatting methods across different difficulty levels. Our results show that Gaussian Splatting is prone to overfitting to training views. Besides, incorporating diffusion priors and improving geometry cannot fundamentally improve NVS under large view changes, highlighting the need for more robust approaches and large-scale training. We have released our data to help advance self-driving and urban robotics simulation technology.

* Project page: https://ai4ce.github.io/EUVS-Benchmark/

Via

Access Paper or Ask Questions

HDGS: Textured 2D Gaussian Splatting for Enhanced Scene Rendering

Dec 02, 2024

Yunzhou Song, Heguang Lin, Jiahui Lei, Lingjie Liu, Kostas Daniilidis

Figure 1 for HDGS: Textured 2D Gaussian Splatting for Enhanced Scene Rendering

Figure 2 for HDGS: Textured 2D Gaussian Splatting for Enhanced Scene Rendering

Figure 3 for HDGS: Textured 2D Gaussian Splatting for Enhanced Scene Rendering

Figure 4 for HDGS: Textured 2D Gaussian Splatting for Enhanced Scene Rendering

Abstract:Recent advancements in neural rendering, particularly 2D Gaussian Splatting (2DGS), have shown promising results for jointly reconstructing fine appearance and geometry by leveraging 2D Gaussian surfels. However, current methods face significant challenges when rendering at arbitrary viewpoints, such as anti-aliasing for down-sampled rendering, and texture detail preservation for high-resolution rendering. We proposed a novel method to align the 2D surfels with texture maps and augment it with per-ray depth sorting and fisher-based pruning for rendering consistency and efficiency. With correct order, per-surfel texture maps significantly improve the capabilities to capture fine details. Additionally, to render high-fidelity details in varying viewpoints, we designed a frustum-based sampling method to mitigate the aliasing artifacts. Experimental results on benchmarks and our custom texture-rich dataset demonstrate that our method surpasses existing techniques, particularly in detail preservation and anti-aliasing.

* Project Page: https://timsong412.github.io/HDGS-ProjPage/

Via

Access Paper or Ask Questions

MotionWavelet: Human Motion Prediction via Wavelet Manifold Learning

Nov 25, 2024

Yuming Feng, Zhiyang Dou, Ling-Hao Chen, Yuan Liu, Tianyu Li, Jingbo Wang, Zeyu Cao, Wenping Wang, Taku Komura, Lingjie Liu

Abstract:Modeling temporal characteristics and the non-stationary dynamics of body movement plays a significant role in predicting human future motions. However, it is challenging to capture these features due to the subtle transitions involved in the complex human motions. This paper introduces MotionWavelet, a human motion prediction framework that utilizes Wavelet Transformation and studies human motion patterns in the spatial-frequency domain. In MotionWavelet, a Wavelet Diffusion Model (WDM) learns a Wavelet Manifold by applying Wavelet Transformation on the motion data therefore encoding the intricate spatial and temporal motion patterns. Once the Wavelet Manifold is built, WDM trains a diffusion model to generate human motions from Wavelet latent vectors. In addition to the WDM, MotionWavelet also presents a Wavelet Space Shaping Guidance mechanism to refine the denoising process to improve conformity with the manifold structure. WDM also develops Temporal Attention-Based Guidance to enhance prediction accuracy. Extensive experiments validate the effectiveness of MotionWavelet, demonstrating improved prediction accuracy and enhanced generalization across various benchmarks. Our code and models will be released upon acceptance.

* Project Page: https://frank-zy-dou.github.io/projects/MotionWavelet/ Video: https://youtu.be/pyWq0OYJdI0?si=4YHfFNXmLnbPC39g

Via

Access Paper or Ask Questions

DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image

Jun 26, 2024

Qingxuan Wu, Zhiyang Dou, Sirui Xu, Soshi Shimada, Chen Wang, Zhengming Yu, Yuan Liu, Cheng Lin, Zeyu Cao, Taku Komura(+4 more)

Figure 1 for DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image

Figure 2 for DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image

Figure 3 for DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image

Figure 4 for DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image

Abstract:Reconstructing 3D hand-face interactions with deformations from a single image is a challenging yet crucial task with broad applications in AR, VR, and gaming. The challenges stem from self-occlusions during single-view hand-face interactions, diverse spatial relationships between hands and face, complex deformations, and the ambiguity of the single-view setting. The first and only method for hand-face interaction recovery, Decaf, introduces a global fitting optimization guided by contact and deformation estimation networks trained on studio-collected data with 3D annotations. However, Decaf suffers from a time-consuming optimization process and limited generalization capability due to its reliance on 3D annotations of hand-face interaction data. To address these issues, we present DICE, the first end-to-end method for Deformation-aware hand-face Interaction reCovEry from a single image. DICE estimates the poses of hands and faces, contacts, and deformations simultaneously using a Transformer-based architecture. It features disentangling the regression of local deformation fields and global mesh vertex locations into two network branches, enhancing deformation and contact estimation for precise and robust hand-face mesh recovery. To improve generalizability, we propose a weakly-supervised training approach that augments the training set using in-the-wild images without 3D ground-truth annotations, employing the depths of 2D keypoints estimated by off-the-shelf models and adversarial priors of poses for supervision. Our experiments demonstrate that DICE achieves state-of-the-art performance on a standard benchmark and in-the-wild data in terms of accuracy and physical plausibility. Additionally, our method operates at an interactive rate (20 fps) on an Nvidia 4090 GPU, whereas Decaf requires more than 15 seconds for a single image. Our code will be publicly available upon publication.

* 23 pages, 9 figures, 3 tables

Via

Access Paper or Ask Questions

GECO: Generative Image-to-3D within a SECOnd

May 30, 2024

Chen Wang, Jiatao Gu, Xiaoxiao Long, Yuan Liu, Lingjie Liu

Figure 1 for GECO: Generative Image-to-3D within a SECOnd

Figure 2 for GECO: Generative Image-to-3D within a SECOnd

Figure 3 for GECO: Generative Image-to-3D within a SECOnd

Figure 4 for GECO: Generative Image-to-3D within a SECOnd

Abstract:3D generation has seen remarkable progress in recent years. Existing techniques, such as score distillation methods, produce notable results but require extensive per-scene optimization, impacting time efficiency. Alternatively, reconstruction-based approaches prioritize efficiency but compromise quality due to their limited handling of uncertainty. We introduce GECO, a novel method for high-quality 3D generative modeling that operates within a second. Our approach addresses the prevalent issues of uncertainty and inefficiency in current methods through a two-stage approach. In the initial stage, we train a single-step multi-view generative model with score distillation. Then, a second-stage distillation is applied to address the challenge of view inconsistency from the multi-view prediction. This two-stage process ensures a balanced approach to 3D generation, optimizing both quality and efficiency. Our comprehensive experiments demonstrate that GECO achieves high-quality image-to-3D generation with an unprecedented level of efficiency.

* Project Page: https://cwchenwang.github.io/geco

Via

Access Paper or Ask Questions

RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

Apr 10, 2024

Jaidev Shriram, Alex Trevithick, Lingjie Liu, Ravi Ramamoorthi

Figure 1 for RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

Figure 2 for RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

Figure 3 for RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

Figure 4 for RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

Abstract:We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.

* Project Page: https://realmdreamer.github.io/

Via

Access Paper or Ask Questions

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Mar 26, 2024

Yufu Wang, Ziyun Wang, Lingjie Liu, Kostas Daniilidis

Figure 1 for TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Figure 2 for TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Figure 3 for TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Figure 4 for TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Abstract:We propose TRAM, a two-stage method to reconstruct a human's global trajectory and motion from in-the-wild videos. TRAM robustifies SLAM to recover the camera motion in the presence of dynamic humans and uses the scene background to derive the motion scale. Using the recovered camera as a metric-scale reference frame, we introduce a video transformer model (VIMO) to regress the kinematic body motion of a human. By composing the two motions, we achieve accurate recovery of 3D humans in the world space, reducing global motion errors by 60% from prior work. https://yufu-wang.github.io/tram4d/

* The project website: https://yufu-wang.github.io/tram4d/

Via

Access Paper or Ask Questions

Track Everything Everywhere Fast and Robustly

Mar 26, 2024

Yunzhou Song, Jiahui Lei, Ziyun Wang, Lingjie Liu, Kostas Daniilidis

Figure 1 for Track Everything Everywhere Fast and Robustly

Figure 2 for Track Everything Everywhere Fast and Robustly

Figure 3 for Track Everything Everywhere Fast and Robustly

Figure 4 for Track Everything Everywhere Fast and Robustly

Abstract:We propose a novel test-time optimization approach for efficiently and robustly tracking any pixel at any time in a video. The latest state-of-the-art optimization-based tracking technique, OmniMotion, requires a prohibitively long optimization time, rendering it impractical for downstream applications. OmniMotion is sensitive to the choice of random seeds, leading to unstable convergence. To improve efficiency and robustness, we introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions. While CaDeX++ incorporates a stronger geometric bias within its architectural design, it also takes advantage of the inductive bias provided by the vision foundation models. Our system utilizes monocular depth estimation to represent scene geometry and enhances the objective by incorporating DINOv2 long-term semantics to regulate the optimization process. Our experiments demonstrate a substantial improvement in training speed (more than \textbf{10 times} faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.

* project page: https://timsong412.github.io/FastOmniTrack/

Via

Access Paper or Ask Questions