Abstract:Recent pose-to-video models can translate 2D pose sequences into photorealistic, identity-preserving dance videos, so the key challenge is to generate temporally coherent, rhythm-aligned 2D poses from music, especially under complex, high-variance in-the-wild distributions. We address this by reframing music-to-dance generation as a music-token-conditioned multi-channel image synthesis problem: 2D pose sequences are encoded as one-hot images, compressed by a pretrained image VAE, and modeled with a DiT-style backbone, allowing us to inherit architectural and training advances from modern text-to-image models and better capture high-variance 2D pose distributions. On top of this formulation, we introduce (i) a time-shared temporal indexing scheme that explicitly synchronizes music tokens and pose latents over time and (ii) a reference-pose conditioning strategy that preserves subject-specific body proportions and on-screen scale while enabling long-horizon segment-and-stitch generation. Experiments on a large in-the-wild 2D dance corpus and the calibrated AIST++2D benchmark show consistent improvements over representative music-to-dance methods in pose- and video-space metrics and human preference, and ablations validate the contributions of the representation, temporal indexing, and reference conditioning. See supplementary videos at https://hot-dance.github.io
Abstract:The recent advancements in 2D generation technology have sparked a widespread discussion on using 2D priors for 3D shape and texture content generation. However, these methods often overlook the subsequent user operations, such as texture aliasing and blurring that occur when the user acquires the 3D model and simplifies its structure. Traditional graphics methods partially alleviate this issue, but recent texture synthesis technologies fail to ensure consistency with the original model's appearance and cannot achieve high-fidelity restoration. Moreover, background noise frequently arises in high-resolution texture synthesis, limiting the practical application of these generation technologies.In this work, we propose a high-resolution and high-fidelity texture restoration technique that uses the rough texture as the initial input to enhance the consistency between the synthetic texture and the initial texture, thereby overcoming the issues of aliasing and blurring caused by the user's structure simplification operations. Additionally, we introduce a background noise smoothing technique based on a self-supervised scheme to address the noise problem in current high-resolution texture synthesis schemes. Our approach enables high-resolution texture synthesis, paving the way for high-definition and high-detail texture synthesis technology. Experiments demonstrate that our scheme outperforms currently known schemes in high-fidelity texture recovery under high-resolution conditions.
Abstract:Generative models for 3D object synthesis have seen significant advancements with the incorporation of prior knowledge distilled from 2D diffusion models. Nevertheless, challenges persist in the form of multi-view geometric inconsistencies and slow generation speeds within the existing 3D synthesis frameworks. This can be attributed to two factors: firstly, the deficiency of abundant geometric a priori knowledge in optimization, and secondly, the entanglement issue between geometry and texture in conventional 3D generation methods.In response, we introduce MetaDreammer, a two-stage optimization approach that leverages rich 2D and 3D prior knowledge. In the first stage, our emphasis is on optimizing the geometric representation to ensure multi-view consistency and accuracy of 3D objects. In the second stage, we concentrate on fine-tuning the geometry and optimizing the texture, thereby achieving a more refined 3D object. Through leveraging 2D and 3D prior knowledge in two stages, respectively, we effectively mitigate the interdependence between geometry and texture. MetaDreamer establishes clear optimization objectives for each stage, resulting in significant time savings in the 3D generation process. Ultimately, MetaDreamer can generate high-quality 3D objects based on textual prompts within 20 minutes, and to the best of our knowledge, it is the most efficient text-to-3D generation method. Furthermore, we introduce image control into the process, enhancing the controllability of 3D generation. Extensive empirical evidence confirms that our method is not only highly efficient but also achieves a quality level that is at the forefront of current state-of-the-art 3D generation techniques.