Human motion generation, a cornerstone technique in animation and video production, has widespread applications in various tasks like text-to-motion and music-to-dance. Previous works focus on developing specialist models tailored for each task without scalability. In this work, we present Large Motion Model (LMM), a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model. A unified motion model is appealing since it can leverage a wide range of motion data to achieve broad generalization beyond a single task. However, it is also challenging due to the heterogeneous nature of substantially different motion data and tasks. LMM tackles these challenges from three principled aspects: 1) Data: We consolidate datasets with different modalities, formats and tasks into a comprehensive yet unified motion generation dataset, MotionVerse, comprising 10 tasks, 16 datasets, a total of 320k sequences, and 100 million frames. 2) Architecture: We design an articulated attention mechanism ArtAttention that incorporates body part-aware modeling into Diffusion Transformer backbone. 3) Pre-Training: We propose a novel pre-training strategy for LMM, which employs variable frame rates and masking forms, to better exploit knowledge from diverse training data. Extensive experiments demonstrate that our generalist LMM achieves competitive performance across various standard motion generation tasks over state-of-the-art specialist models. Notably, LMM exhibits strong generalization capabilities and emerging properties across many unseen tasks. Additionally, our ablation studies reveal valuable insights about training and scaling up large motion models for future research.
Surface parameterization is a fundamental geometry processing problem with rich downstream applications. Traditional approaches are designed to operate on well-behaved mesh models with high-quality triangulations that are laboriously produced by specialized 3D modelers, and thus unable to meet the processing demand for the current explosion of ordinary 3D data. In this paper, we seek to perform UV unwrapping on unstructured 3D point clouds. Technically, we propose ParaPoint, an unsupervised neural learning pipeline for achieving global free-boundary surface parameterization by building point-wise mappings between given 3D points and 2D UV coordinates with adaptively deformed boundaries. We ingeniously construct several geometrically meaningful sub-networks with specific functionalities, and assemble them into a bi-directional cycle mapping framework. We also design effective loss functions and auxiliary differential geometric constraints for the optimization of the neural mapping process. To the best of our knowledge, this work makes the first attempt to investigate neural point cloud parameterization that pursues both global mappings and free boundaries. Experiments demonstrate the effectiveness and inspiring potential of our proposed learning paradigm. The code will be publicly available.
3D Gaussian Splatting (3D-GS) has emerged as a significant advancement in the field of Computer Graphics, offering explicit scene representation and novel view synthesis without the reliance on neural networks, such as Neural Radiance Fields (NeRF). This technique has found diverse applications in areas such as robotics, urban mapping, autonomous navigation, and virtual reality/augmented reality, just name a few. Given the growing popularity and expanding research in 3D Gaussian Splatting, this paper presents a comprehensive survey of relevant papers from the past year. We organize the survey into taxonomies based on characteristics and applications, providing an introduction to the theoretical underpinnings of 3D Gaussian Splatting. Our goal through this survey is to acquaint new researchers with 3D Gaussian Splatting, serve as a valuable reference for seminal works in the field, and inspire future research directions, as discussed in our concluding section.
We introduce a new generative model that combines latent diffusion with persistent homology to create 3D shapes with high diversity, with a special emphasis on their topological characteristics. Our method involves representing 3D shapes as implicit fields, then employing persistent homology to extract topological features, including Betti numbers and persistence diagrams. The shape generation process consists of two steps. Initially, we employ a transformer-based autoencoding module to embed the implicit representation of each 3D shape into a set of latent vectors. Subsequently, we navigate through the learned latent space via a diffusion model. By strategically incorporating topological features into the diffusion process, our generative module is able to produce a richer variety of 3D shapes with different topological structures. Furthermore, our framework is flexible, supporting generation tasks constrained by a variety of inputs, including sparse and partial point clouds, as well as sketches. By modifying the persistence diagrams, we can alter the topology of the shapes generated from these input modalities.
The rapid development of open-source large language models (LLMs) has been truly remarkable. However, the scaling law described in previous literature presents varying conclusions, which casts a dark cloud over scaling LLMs. We delve into the study of scaling laws and present our distinctive findings that facilitate scaling of large scale models in two commonly used open-source configurations, 7B and 67B. Guided by the scaling laws, we introduce DeepSeek LLM, a project dedicated to advancing open-source language models with a long-term perspective. To support the pre-training phase, we have developed a dataset that currently consists of 2 trillion tokens and is continuously expanding. We further conduct supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) on DeepSeek LLM Base models, resulting in the creation of DeepSeek Chat models. Our evaluation results demonstrate that DeepSeek LLM 67B surpasses LLaMA-2 70B on various benchmarks, particularly in the domains of code, mathematics, and reasoning. Furthermore, open-ended evaluations reveal that DeepSeek LLM 67B Chat exhibits superior performance compared to GPT-3.5.
This paper presents a novel two-stage approach for reconstructing human faces from sparse-view images, a task made challenging by the unique geometry and complex skin reflectance of each individual. Our method focuses on decomposing key facial attributes, including geometry, diffuse reflectance, and specular reflectance, from ambient light. Initially, we create a general facial template from a diverse collection of individual faces, capturing essential geometric and reflectance characteristics. Guided by this template, we refine each specific face model in the second stage, which further considers the interaction between geometry and reflectance, as well as the subsurface scattering effects on facial skin. Our method enables the reconstruction of high-quality facial representations from as few as three images, offering improved geometric accuracy and reflectance detail. Through comprehensive evaluations and comparisons, our method demonstrates superiority over existing techniques. Our method effectively disentangles geometry and reflectance components, leading to enhanced quality in synthesizing new views and opening up possibilities for applications such as relighting and reflectance editing. We will make the code publicly available.
Neural radiance fields (NeRF) typically require a complete set of images taken from multiple camera perspectives to accurately reconstruct geometric details. However, this approach raise significant privacy concerns in the context of facial reconstruction. The critical need for privacy protection often leads invidividuals to be reluctant in sharing their facial images, due to fears of potential misuse or security risks. Addressing these concerns, we propose a method that leverages privacy-preserving images for reconstructing 3D head geometry within the NeRF framework. Our method stands apart from traditional facial reconstruction techniques as it does not depend on RGB information from images containing sensitive facial data. Instead, it effectively generates plausible facial geometry using a series of identity-obscured inputs, thereby protecting facial privacy.
In this paper, we propose a new method, called DoubleCoverUDF, for extracting the zero level-set from unsigned distance fields (UDFs). DoubleCoverUDF takes a learned UDF and a user-specified parameter $r$ (a small positive real number) as input and extracts an iso-surface with an iso-value $r$ using the conventional marching cubes algorithm. We show that the computed iso-surface is the boundary of the $r$-offset volume of the target zero level-set $S$, which is an orientable manifold, regardless of the topology of $S$. Next, the algorithm computes a covering map to project the boundary mesh onto $S$, preserving the mesh's topology and avoiding folding. If $S$ is an orientable manifold surface, our algorithm separates the double-layered mesh into a single layer using a robust minimum-cut post-processing step. Otherwise, it keeps the double-layered mesh as the output. We validate our algorithm by reconstructing 3D surfaces of open models and demonstrate its efficacy and effectiveness on synthetic models and benchmark datasets. Our experimental results confirm that our method is robust and produces meshes with better quality in terms of both visual evaluation and quantitative measures than existing UDF-based methods. The source code is available at https://github.com/jjjkkyz/DCUDF.
Modelling temporal networks for dynamic link prediction of new nodes has many real-world applications, such as providing relevant item recommendations to new customers in recommender systems and suggesting appropriate posts to new users on social platforms. Unlike old nodes, new nodes have few historical links, which poses a challenge for the dynamic link prediction task. Most existing dynamic models treat all nodes equally and are not specialized for new nodes, resulting in suboptimal performances. In this paper, we consider dynamic link prediction of new nodes as a few-shot problem and propose a novel model based on the meta-learning principle to effectively mitigate this problem. Specifically, we develop a temporal encoder with a node-level span memory to obtain a new node embedding, and then we use a predictor to determine whether the new node generates a link. To overcome the few-shot challenge, we incorporate the encoder-predictor into the meta-learning paradigm, which can learn two types of implicit information during the formation of the temporal network through span adaptation and node adaptation. The acquired implicit information can serve as model initialisation and facilitate rapid adaptation to new nodes through a fine-tuning process on just a few links. Experiments on three publicly available datasets demonstrate the superior performance of our model compared to existing state-of-the-art methods.
The growing capabilities of neural rendering have increased the demand for new techniques that enable the intuitive editing of 3D objects, particularly when they are represented as neural implicit surfaces. In this paper, we present a novel neural algorithm to parameterize neural implicit surfaces to simple parametric domains, such as spheres, cubes or polycubes, where 3D radiance field can be represented as a 2D field, thereby facilitating visualization and various editing tasks. Technically, our method computes a bi-directional deformation between 3D objects and their chosen parametric domains, eliminating the need for any prior information. We adopt a forward mapping of points on the zero level set of the 3D object to a parametric domain, followed by a backward mapping through inverse deformation. To ensure the map is bijective, we employ a cycle loss while optimizing the smoothness of both deformations. Additionally, we leverage a Laplacian regularizer to effectively control angle distortion and offer the flexibility to choose from a range of parametric domains for managing area distortion. Designed for compatibility, our framework integrates seamlessly with existing neural rendering pipelines, taking multi-view images as input to reconstruct 3D geometry and compute the corresponding texture map. We also introduce a simple yet effective technique for intrinsic radiance decomposition, facilitating both view-independent material editing and view-dependent shading editing. Our method allows for the immediate rendering of edited textures through volume rendering, without the need for network re-training. Moreover, our approach supports the co-parameterization of multiple objects and enables texture transfer between them. We demonstrate the effectiveness of our method on images of human heads and man-made objects. We will make the source code publicly available.