Alert button
Picture for Haonan Qiu

Haonan Qiu

Alert button

FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling

Oct 27, 2023
Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu

With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at our website: http://haonanqiu.com/projects/FreeNoise.html.

* Project Page: http://haonanqiu.com/projects/FreeNoise.html Code Repo: https://github.com/arthur-qiu/LongerCrafter 
Viaarxiv icon

FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling

Oct 23, 2023
Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu

With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at our website: http://haonanqiu.com/projects/FreeNoise.html.

* Project Page: http://haonanqiu.com/projects/FreeNoise.html Code Repo: https://github.com/arthur-qiu/LongerCrafter 
Viaarxiv icon

ReliTalk: Relightable Talking Portrait Generation from a Single Video

Sep 05, 2023
Haonan Qiu, Zhaoxi Chen, Yuming Jiang, Hang Zhou, Xiangyu Fan, Lei Yang, Wayne Wu, Ziwei Liu

Figure 1 for ReliTalk: Relightable Talking Portrait Generation from a Single Video
Figure 2 for ReliTalk: Relightable Talking Portrait Generation from a Single Video
Figure 3 for ReliTalk: Relightable Talking Portrait Generation from a Single Video
Figure 4 for ReliTalk: Relightable Talking Portrait Generation from a Single Video

Recent years have witnessed great progress in creating vivid audio-driven portraits from monocular videos. However, how to seamlessly adapt the created video avatars to other scenarios with different backgrounds and lighting conditions remains unsolved. On the other hand, existing relighting studies mostly rely on dynamically lighted or multi-view data, which are too expensive for creating video portraits. To bridge this gap, we propose ReliTalk, a novel framework for relightable audio-driven talking portrait generation from monocular videos. Our key insight is to decompose the portrait's reflectance from implicitly learned audio-driven facial normals and images. Specifically, we involve 3D facial priors derived from audio features to predict delicate normal maps through implicit functions. These initially predicted normals then take a crucial part in reflectance decomposition by dynamically estimating the lighting condition of the given video. Moreover, the stereoscopic face representation is refined using the identity-consistent loss under simulated multiple lighting conditions, addressing the ill-posed problem caused by limited views available from a single monocular video. Extensive experiments validate the superiority of our proposed framework on both real and synthetic datasets. Our code is released in https://github.com/arthur-qiu/ReliTalk.

Viaarxiv icon

Temporal Contrastive Learning for Spiking Neural Networks

May 23, 2023
Haonan Qiu, Zeyin Song, Yanqi Chen, Munan Ning, Wei Fang, Tao Sun, Zhengyu Ma, Li Yuan, Yonghong Tian

Figure 1 for Temporal Contrastive Learning for Spiking Neural Networks
Figure 2 for Temporal Contrastive Learning for Spiking Neural Networks
Figure 3 for Temporal Contrastive Learning for Spiking Neural Networks
Figure 4 for Temporal Contrastive Learning for Spiking Neural Networks

Biologically inspired spiking neural networks (SNNs) have garnered considerable attention due to their low-energy consumption and spatio-temporal information processing capabilities. Most existing SNNs training methods first integrate output information across time steps, then adopt the cross-entropy (CE) loss to supervise the prediction of the average representations. However, in this work, we find the method above is not ideal for the SNNs training as it omits the temporal dynamics of SNNs and degrades the performance quickly with the decrease of inference time steps. One tempting method to model temporal correlations is to apply the same label supervision at each time step and treat them identically. Although it can acquire relatively consistent performance across various time steps, it still faces challenges in obtaining SNNs with high performance. Inspired by these observations, we propose Temporal-domain supervised Contrastive Learning (TCL) framework, a novel method to obtain SNNs with low latency and high performance by incorporating contrastive supervision with temporal domain information. Contrastive learning (CL) prompts the network to discern both consistency and variability in the representation space, enabling it to better learn discriminative and generalizable features. We extend this concept to the temporal domain of SNNs, allowing us to flexibly and fully leverage the correlation between representations at different time steps. Furthermore, we propose a Siamese Temporal-domain supervised Contrastive Learning (STCL) framework to enhance the SNNs via augmentation, temporal and class constraints simultaneously. Extensive experimental results demonstrate that SNNs trained by our TCL and STCL can achieve both high performance and low latency, achieving state-of-the-art performance on a variety of datasets (e.g., CIFAR-10, CIFAR-100, and DVS-CIFAR10).

Viaarxiv icon

StyleFaceV: Face Video Generation via Decomposing and Recomposing Pretrained StyleGAN3

Aug 16, 2022
Haonan Qiu, Yuming Jiang, Hang Zhou, Wayne Wu, Ziwei Liu

Figure 1 for StyleFaceV: Face Video Generation via Decomposing and Recomposing Pretrained StyleGAN3
Figure 2 for StyleFaceV: Face Video Generation via Decomposing and Recomposing Pretrained StyleGAN3
Figure 3 for StyleFaceV: Face Video Generation via Decomposing and Recomposing Pretrained StyleGAN3
Figure 4 for StyleFaceV: Face Video Generation via Decomposing and Recomposing Pretrained StyleGAN3

Realistic generative face video synthesis has long been a pursuit in both computer vision and graphics community. However, existing face video generation methods tend to produce low-quality frames with drifted facial identities and unnatural movements. To tackle these challenges, we propose a principled framework named StyleFaceV, which produces high-fidelity identity-preserving face videos with vivid movements. Our core insight is to decompose appearance and pose information and recompose them in the latent space of StyleGAN3 to produce stable and dynamic results. Specifically, StyleGAN3 provides strong priors for high-fidelity facial image generation, but the latent space is intrinsically entangled. By carefully examining its latent properties, we propose our decomposition and recomposition designs which allow for the disentangled combination of facial appearance and movements. Moreover, a temporal-dependent model is built upon the decomposed latent features, and samples reasonable sequences of motions that are capable of generating realistic and temporally coherent face videos. Particularly, our pipeline is trained with a joint training strategy on both static images and high-quality video data, which is of higher data efficiency. Extensive experiments demonstrate that our framework achieves state-of-the-art face video generation results both qualitatively and quantitatively. Notably, StyleFaceV is capable of generating realistic $1024\times1024$ face videos even without high-resolution training videos.

* Project Page: http://haonanqiu.com/projects/StyleFaceV.html; Code Repo: https://github.com/arthur-qiu/StyleFaceV 
Viaarxiv icon

Text2Human: Text-Driven Controllable Human Image Generation

May 31, 2022
Yuming Jiang, Shuai Yang, Haonan Qiu, Wayne Wu, Chen Change Loy, Ziwei Liu

Figure 1 for Text2Human: Text-Driven Controllable Human Image Generation
Figure 2 for Text2Human: Text-Driven Controllable Human Image Generation
Figure 3 for Text2Human: Text-Driven Controllable Human Image Generation
Figure 4 for Text2Human: Text-Driven Controllable Human Image Generation

Generating high-quality and diverse human images is an important yet challenging task in vision and graphics. However, existing generative models often fall short under the high diversity of clothing shapes and textures. Furthermore, the generation process is even desired to be intuitively controllable for layman users. In this work, we present a text-driven controllable framework, Text2Human, for a high-quality and diverse human generation. We synthesize full-body human images starting from a given human pose with two dedicated steps. 1) With some texts describing the shapes of clothes, the given human pose is first translated to a human parsing map. 2) The final human image is then generated by providing the system with more attributes about the textures of clothes. Specifically, to model the diversity of clothing textures, we build a hierarchical texture-aware codebook that stores multi-scale neural representations for each type of texture. The codebook at the coarse level includes the structural representations of textures, while the codebook at the fine level focuses on the details of textures. To make use of the learned hierarchical codebook to synthesize desired images, a diffusion-based transformer sampler with mixture of experts is firstly employed to sample indices from the coarsest level of the codebook, which then is used to predict the indices of the codebook at finer levels. The predicted indices at different levels are translated to human images by the decoder learned accompanied with hierarchical codebooks. The use of mixture-of-experts allows for the generated image conditioned on the fine-grained text input. The prediction for finer level indices refines the quality of clothing textures. Extensive quantitative and qualitative evaluations demonstrate that our proposed framework can generate more diverse and realistic human images compared to state-of-the-art methods.

* SIGGRAPH 2022; Project Page: https://yumingj.github.io/projects/Text2Human.html, Codes available at https://github.com/yumingj/Text2Human 
Viaarxiv icon

Few-shot Forgery Detection via Guided Adversarial Interpolation

Apr 12, 2022
Haonan Qiu, Siyu Chen, Bei Gan, Kun Wang, Huafeng Shi, Jing Shao, Ziwei Liu

Figure 1 for Few-shot Forgery Detection via Guided Adversarial Interpolation
Figure 2 for Few-shot Forgery Detection via Guided Adversarial Interpolation
Figure 3 for Few-shot Forgery Detection via Guided Adversarial Interpolation
Figure 4 for Few-shot Forgery Detection via Guided Adversarial Interpolation

Realistic visual media synthesis is becoming a critical societal issue with the surge of face manipulation models; new forgery approaches emerge at an unprecedented pace. Unfortunately, existing forgery detection methods suffer significant performance drops when applied to novel forgery approaches. In this work, we address the few-shot forgery detection problem by designing a comprehensive benchmark based on coverage analysis among various forgery approaches, and proposing Guided Adversarial Interpolation (GAI). Our key insight is that there exist transferable distribution characteristics among different forgery approaches with the majority and minority classes. Specifically, we enhance the discriminative ability against novel forgery approaches via adversarially interpolating the artifacts of the minority samples to the majority samples under the guidance of a teacher network. Unlike the standard re-balancing method which usually results in over-fitting to minority classes, our method simultaneously takes account of the diversity of majority information as well as the significance of minority information. Extensive experiments demonstrate that our GAI achieves state-of-the-art performances on the established few-shot forgery detection benchmark. Notably, our method is also validated to be robust to choices of majority and minority forgery approaches.

Viaarxiv icon

SemanticAdv: Generating Adversarial Examples via Attribute-conditional Image Editing

Jun 19, 2019
Haonan Qiu, Chaowei Xiao, Lei Yang, Xinchen Yan, Honglak Lee, Bo Li

Figure 1 for SemanticAdv: Generating Adversarial Examples via Attribute-conditional Image Editing
Figure 2 for SemanticAdv: Generating Adversarial Examples via Attribute-conditional Image Editing
Figure 3 for SemanticAdv: Generating Adversarial Examples via Attribute-conditional Image Editing
Figure 4 for SemanticAdv: Generating Adversarial Examples via Attribute-conditional Image Editing

Deep neural networks (DNNs) have achieved great success in various applications due to their strong expressive power. However, recent studies have shown that DNNs are vulnerable to adversarial examples which are manipulated instances targeting to mislead DNNs to make incorrect predictions. Currently, most such adversarial examples try to guarantee "subtle perturbation" by limiting its $L_p$ norm. In this paper, we aim to explore the impact of semantic manipulation on DNNs predictions by manipulating the semantic attributes of images and generate "unrestricted adversarial examples". Such semantic based perturbation is more practical compared with pixel level manipulation. In particular, we propose an algorithm SemanticAdv which leverages disentangled semantic factors to generate adversarial perturbation via altering either single or a combination of semantic attributes. We conduct extensive experiments to show that the semantic based adversarial examples can not only fool different learning tasks such as face verification and landmark detection, but also achieve high attack success rate against real-world black-box services such as Azure face verification service. Such structured adversarial examples with controlled semantic manipulation can shed light on further understanding about vulnerabilities of DNNs as well as potential defensive approaches.

Viaarxiv icon

Two-phase Hair Image Synthesis by Self-Enhancing Generative Model

Feb 28, 2019
Haonan Qiu, Chuan Wang, Hang Zhu, Xiangyu Zhu, Jinjin Gu, Xiaoguang Han

Figure 1 for Two-phase Hair Image Synthesis by Self-Enhancing Generative Model
Figure 2 for Two-phase Hair Image Synthesis by Self-Enhancing Generative Model
Figure 3 for Two-phase Hair Image Synthesis by Self-Enhancing Generative Model
Figure 4 for Two-phase Hair Image Synthesis by Self-Enhancing Generative Model

Generating plausible hair image given limited guidance, such as sparse sketches or low-resolution image, has been made possible with the rise of Generative Adversarial Networks (GANs). Traditional image-to-image translation networks can generate recognizable results, but finer textures are usually lost and blur artifacts commonly exist. In this paper, we propose a two-phase generative model for high-quality hair image synthesis. The two-phase pipeline first generates a coarse image by an existing image translation model, then applies a re-generating network with self-enhancing capability to the coarse image. The self-enhancing capability is achieved by a proposed structure extraction layer, which extracts the texture and orientation map from a hair image. Extensive experiments on two tasks, Sketch2Hair and Hair Super-Resolution, demonstrate that our approach is able to synthesize plausible hair image with finer details, and outperforms the state-of-the-art.

Viaarxiv icon

Precise Temporal Action Localization by Evolving Temporal Proposals

Apr 13, 2018
Haonan Qiu, Yingbin Zheng, Hao Ye, Yao Lu, Feng Wang, Liang He

Figure 1 for Precise Temporal Action Localization by Evolving Temporal Proposals
Figure 2 for Precise Temporal Action Localization by Evolving Temporal Proposals
Figure 3 for Precise Temporal Action Localization by Evolving Temporal Proposals
Figure 4 for Precise Temporal Action Localization by Evolving Temporal Proposals

Locating actions in long untrimmed videos has been a challenging problem in video content analysis. The performances of existing action localization approaches remain unsatisfactory in precisely determining the beginning and the end of an action. Imitating the human perception procedure with observations and refinements, we propose a novel three-phase action localization framework. Our framework is embedded with an Actionness Network to generate initial proposals through frame-wise similarity grouping, and then a Refinement Network to conduct boundary adjustment on these proposals. Finally, the refined proposals are sent to a Localization Network for further fine-grained location regression. The whole process can be deemed as multi-stage refinement using a novel non-local pyramid feature under various temporal granularities. We evaluate our framework on THUMOS14 benchmark and obtain a significant improvement over the state-of-the-arts approaches. Specifically, the performance gain is remarkable under precise localization with high IoU thresholds. Our proposed framework achieves mAP@IoU=0.5 of 34.2%.

* ACM International Conference on Multimedia Retrieval (ICMR), 2018, pp. 388-396  
Viaarxiv icon