Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tao Mei

T-SVG: Text-Driven Stereoscopic Video Generation

Dec 12, 2024

Qiao Jin, Xiaodong Chen, Wu Liu, Tao Mei, Yongdong Zhang

Figure 1 for T-SVG: Text-Driven Stereoscopic Video Generation

Figure 2 for T-SVG: Text-Driven Stereoscopic Video Generation

Figure 3 for T-SVG: Text-Driven Stereoscopic Video Generation

Figure 4 for T-SVG: Text-Driven Stereoscopic Video Generation

Abstract:The advent of stereoscopic videos has opened new horizons in multimedia, particularly in extended reality (XR) and virtual reality (VR) applications, where immersive content captivates audiences across various platforms. Despite its growing popularity, producing stereoscopic videos remains challenging due to the technical complexities involved in generating stereo parallax. This refers to the positional differences of objects viewed from two distinct perspectives and is crucial for creating depth perception. This complex process poses significant challenges for creators aiming to deliver convincing and engaging presentations. To address these challenges, this paper introduces the Text-driven Stereoscopic Video Generation (T-SVG) system. This innovative, model-agnostic, zero-shot approach streamlines video generation by using text prompts to create reference videos. These videos are transformed into 3D point cloud sequences, which are rendered from two perspectives with subtle parallax differences, achieving a natural stereoscopic effect. T-SVG represents a significant advancement in stereoscopic content creation by integrating state-of-the-art, training-free techniques in text-to-video generation, depth estimation, and video inpainting. Its flexible architecture ensures high efficiency and user-friendliness, allowing seamless updates with newer models without retraining. By simplifying the production pipeline, T-SVG makes stereoscopic video generation accessible to a broader audience, demonstrating its potential to revolutionize the field.

* 5 pages, 4 figures

Via

Access Paper or Ask Questions

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

Sep 12, 2024

Yifu Chen, Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Zhineng Chen, Tao Mei

Abstract:Recent years have witnessed the success of large text-to-image diffusion models and their remarkable potential to generate high-quality images. The further pursuit of enhancing the editability of images has sparked significant interest in the downstream task of inpainting a novel object described by a text prompt within a designated region in the image. Nevertheless, the problem is not trivial from two aspects: 1) Solely relying on one single U-Net to align text prompt and visual object across all the denoising timesteps is insufficient to generate desired objects; 2) The controllability of object generation is not guaranteed in the intricate sampling space of diffusion model. In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features. To achieve this, we cascade a Transformer-based semantic inpainter and an object inpainting diffusion model, leading to a novel CAscaded Transformer-Diffusion (CAT-Diffusion) framework for text-guided object inpainting. Technically, the semantic inpainter is trained to predict the semantic features of the target object conditioning on unmasked context and text prompt. The outputs of the semantic inpainter then act as the informative visual prompts to guide high-fieldity object generation through a reference adapter layer, leading to controllable object inpainting. Extensive evaluations on OpenImages-V6 and MSCOCO validate the superiority of CAT-Diffusion against the state-of-the-art methods. Code is available at \url{https://github.com/Nnn-s/CATdiffusion}.

* ECCV 2024. Source code is available at https://github.com/Nnn-s/CATdiffusion

Via

Access Paper or Ask Questions

Improving Virtual Try-On with Garment-focused Diffusion Models

Sep 12, 2024

Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei

Figure 1 for Improving Virtual Try-On with Garment-focused Diffusion Models

Figure 2 for Improving Virtual Try-On with Garment-focused Diffusion Models

Figure 3 for Improving Virtual Try-On with Garment-focused Diffusion Models

Figure 4 for Improving Virtual Try-On with Garment-focused Diffusion Models

Abstract:Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: \href{https://github.com/siqi0905/GarDiff/tree/master}{https://github.com/siqi0905/GarDiff/tree/master}.

* ECCV 2024. Source code is available at https://github.com/siqi0905/GarDiff/tree/master

Via

Access Paper or Ask Questions

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

Sep 11, 2024

Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Chong-Wah Ngo, Tao Mei

Abstract:Despite having tremendous progress in image-to-3D generation, existing methods still struggle to produce multi-view consistent images with high-resolution textures in detail, especially in the paradigm of 2D diffusion that lacks 3D awareness. In this work, we present High-resolution Image-to-3D model (Hi3D), a new video diffusion based paradigm that redefines a single image to multi-view images as 3D-aware sequential image generation (i.e., orbital video generation). This methodology delves into the underlying temporal consistency knowledge in video diffusion model that generalizes well to geometry consistency across multiple views in 3D generation. Technically, Hi3D first empowers the pre-trained video diffusion model with 3D-aware prior (camera pose condition), yielding multi-view images with low-resolution texture details. A 3D-aware video-to-video refiner is learnt to further scale up the multi-view images with high-resolution texture details. Such high-resolution multi-view images are further augmented with novel views through 3D Gaussian Splatting, which are finally leveraged to obtain high-fidelity meshes via 3D reconstruction. Extensive experiments on both novel view synthesis and single view reconstruction demonstrate that our Hi3D manages to produce superior multi-view consistency images with highly-detailed textures. Source code and data are available at \url{https://github.com/yanghb22-fdu/Hi3D-Official}.

* ACM Multimedia 2024. Source code is available at \url{https://github.com/yanghb22-fdu/Hi3D-Official}

Via

Access Paper or Ask Questions

FreeEnhance: Tuning-Free Image Enhancement via Content-Consistent Noising-and-Denoising Process

Sep 11, 2024

Yang Luo, Yiheng Zhang, Zhaofan Qiu, Ting Yao, Zhineng Chen, Yu-Gang Jiang, Tao Mei

Figure 1 for FreeEnhance: Tuning-Free Image Enhancement via Content-Consistent Noising-and-Denoising Process

Figure 2 for FreeEnhance: Tuning-Free Image Enhancement via Content-Consistent Noising-and-Denoising Process

Figure 3 for FreeEnhance: Tuning-Free Image Enhancement via Content-Consistent Noising-and-Denoising Process

Figure 4 for FreeEnhance: Tuning-Free Image Enhancement via Content-Consistent Noising-and-Denoising Process

Abstract:The emergence of text-to-image generation models has led to the recognition that image enhancement, performed as post-processing, would significantly improve the visual quality of the generated images. Exploring diffusion models to enhance the generated images nevertheless is not trivial and necessitates to delicately enrich plentiful details while preserving the visual appearance of key content in the original image. In this paper, we propose a novel framework, namely FreeEnhance, for content-consistent image enhancement using the off-the-shelf image diffusion models. Technically, FreeEnhance is a two-stage process that firstly adds random noise to the input image and then capitalizes on a pre-trained image diffusion model (i.e., Latent Diffusion Models) to denoise and enhance the image details. In the noising stage, FreeEnhance is devised to add lighter noise to the region with higher frequency to preserve the high-frequent patterns (e.g., edge, corner) in the original image. In the denoising stage, we present three target properties as constraints to regularize the predicted noise, enhancing images with high acutance and high visual quality. Extensive experiments conducted on the HPDv2 dataset demonstrate that our FreeEnhance outperforms the state-of-the-art image enhancement models in terms of quantitative metrics and human preference. More remarkably, FreeEnhance also shows higher human preference compared to the commercial image enhancement solution of Magnific AI.

* ACM Multimedia 2024

Via

Access Paper or Ask Questions

DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation

Sep 11, 2024

Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Tao Mei

Figure 1 for DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation

Figure 2 for DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation

Figure 3 for DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation

Figure 4 for DreamMesh: Jointly Manipulating and Texturing Triangle Meshes for Text-to-3D Generation

Abstract:Learning radiance fields (NeRF) with powerful 2D diffusion models has garnered popularity for text-to-3D generation. Nevertheless, the implicit 3D representations of NeRF lack explicit modeling of meshes and textures over surfaces, and such surface-undefined way may suffer from the issues, e.g., noisy surfaces with ambiguous texture details or cross-view inconsistency. To alleviate this, we present DreamMesh, a novel text-to-3D architecture that pivots on well-defined surfaces (triangle meshes) to generate high-fidelity explicit 3D model. Technically, DreamMesh capitalizes on a distinctive coarse-to-fine scheme. In the coarse stage, the mesh is first deformed by text-guided Jacobians and then DreamMesh textures the mesh with an interlaced use of 2D diffusion models in a tuning free manner from multiple viewpoints. In the fine stage, DreamMesh jointly manipulates the mesh and refines the texture map, leading to high-quality triangle meshes with high-fidelity textured materials. Extensive experiments demonstrate that DreamMesh significantly outperforms state-of-the-art text-to-3D methods in faithfully generating 3D content with richer textual details and enhanced geometry. Our project page is available at https://dreammesh.github.io.

* ECCV 2024. Project page is available at \url{https://dreammesh.github.io}

Via

Access Paper or Ask Questions

Motion Capture from Inertial and Vision Sensors

Jul 23, 2024

Xiaodong Chen, Wu Liu, Qian Bao, Xinchen Liu, Quanwei Yang, Ruoli Dai, Tao Mei

Figure 1 for Motion Capture from Inertial and Vision Sensors

Figure 2 for Motion Capture from Inertial and Vision Sensors

Figure 3 for Motion Capture from Inertial and Vision Sensors

Figure 4 for Motion Capture from Inertial and Vision Sensors

Abstract:Human motion capture is the foundation for many computer vision and graphics tasks. While industrial motion capture systems with complex camera arrays or expensive wearable sensors have been widely adopted in movie and game production, consumer-affordable and easy-to-use solutions for personal applications are still far from mature. To utilize a mixture of a monocular camera and very few inertial measurement units (IMUs) for accurate multi-modal human motion capture in daily life, we contribute MINIONS in this paper, a large-scale Motion capture dataset collected from INertial and visION Sensors. MINIONS has several featured properties: 1) large scale of over five million frames and 400 minutes duration; 2) multi-modality data of IMUs signals and RGB videos labeled with joint positions, joint rotations, SMPL parameters, etc.; 3) a diverse set of 146 fine-grained single and interactive actions with textual descriptions. With the proposed MINIONS, we conduct experiments on multi-modal motion capture and explore the possibilities of consumer-affordable motion capture using a monocular camera and very few IMUs. The experiment results emphasize the unique advantages of inertial and vision sensors, showcasing the promise of consumer-affordable multi-modal motion capture and providing a valuable resource for further research and development.

* 17 pages,9 figures

Via

Access Paper or Ask Questions

Prompt Refinement with Image Pivot for Text-to-Image Generation

Jun 28, 2024

Jingtao Zhan, Qingyao Ai, Yiqun Liu, Yingwei Pan, Ting Yao, Jiaxin Mao, Shaoping Ma, Tao Mei

Figure 1 for Prompt Refinement with Image Pivot for Text-to-Image Generation

Figure 2 for Prompt Refinement with Image Pivot for Text-to-Image Generation

Figure 3 for Prompt Refinement with Image Pivot for Text-to-Image Generation

Figure 4 for Prompt Refinement with Image Pivot for Text-to-Image Generation

Abstract:For text-to-image generation, automatically refining user-provided natural language prompts into the keyword-enriched prompts favored by systems is essential for the user experience. Such a prompt refinement process is analogous to translating the prompt from "user languages" into "system languages". However, the scarcity of such parallel corpora makes it difficult to train a prompt refinement model. Inspired by zero-shot machine translation techniques, we introduce Prompt Refinement with Image Pivot (PRIP). PRIP innovatively uses the latent representation of a user-preferred image as an intermediary "pivot" between the user and system languages. It decomposes the refinement process into two data-rich tasks: inferring representations of user-preferred images from user languages and subsequently translating image representations into system languages. Thus, it can leverage abundant data for training. Extensive experiments show that PRIP substantially outperforms a wide range of baselines and effectively transfers to unseen systems in a zero-shot manner.

* Accepted by ACL 2024

Via

Access Paper or Ask Questions

Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

Mar 26, 2024

Yurui Qian, Qi Cai, Yingwei Pan, Yehao Li, Ting Yao, Qibin Sun, Tao Mei

Figure 1 for Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

Figure 2 for Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

Figure 3 for Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

Figure 4 for Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

Abstract:Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prior samples. Instead of simply applying moving average to the denoised samples at different timesteps, we first map the denoised samples to data space and then perform moving average to avoid distribution shift across timesteps. In view that diffusion models evolve the recovery from low-frequency components to high-frequency details, we further decompose the samples into different frequency components and execute moving average separately on each component. We name the complete approach "Moving Average Sampling in Frequency domain (MASF)". MASF could be seamlessly integrated into mainstream pre-trained diffusion models and sampling schedules. Extensive experiments on both unconditional and conditional diffusion models demonstrate that our MASF leads to superior performances compared to the baselines, with almost negligible additional complexity cost.

* CVPR 2024

Via

Access Paper or Ask Questions

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Mar 25, 2024

Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, Tao Mei

Figure 1 for Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Figure 2 for Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Figure 3 for Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Figure 4 for Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Abstract:Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos, but also the temporal consistency across video frames. In this paper, we propose a novel approach, pursuing Spatial Adaptation and Temporal Coherence (SATeCo), for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Technically, SATeCo freezes all the parameters of the pre-trained UNet and VAE, and only optimizes two deliberately-designed spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules, in the decoder of UNet and VAE. SFA modulates frame features via adaptively estimating affine parameters for each pixel, guaranteeing pixel-wise guidance for high-resolution frame synthesis. TFA delves into feature interaction within a 3D local window (tubelet) through self-attention, and executes cross-attention between tubelet and its low-resolution counterpart to guide temporal feature alignment. Extensive experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.

* CVPR 2024

Via

Access Paper or Ask Questions