Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qifeng Chen

Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts

Mar 13, 2024

Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Chenyang Qi, Chengfei Cai, Xiu Li, Zhifeng Li, Heung-Yeung Shum, Wei Liu(+1 more)

Figure 1 for Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts

Figure 2 for Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts

Figure 3 for Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts

Figure 4 for Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts

Abstract:Despite recent advances in image-to-video generation, better controllability and local animation are less explored. Most existing image-to-video methods are not locally aware and tend to move the entire scene. However, human artists may need to control the movement of different objects or regions. Additionally, current I2V methods require users not only to describe the target motion but also to provide redundant detailed descriptions of frame contents. These two issues hinder the practical utilization of current I2V tools. In this paper, we propose a practical framework, named Follow-Your-Click, to achieve image animation with a simple user click (for specifying what to move) and a short motion prompt (for specifying how to move). Technically, we propose the first-frame masking strategy, which significantly improves the video generation quality, and a motion-augmented module equipped with a short motion prompt dataset to improve the short prompt following abilities of our model. To further control the motion speed, we propose flow-based motion magnitude control to control the speed of target movement more precisely. Our framework has simpler yet precise user control and better generation performance than previous methods. Extensive experiments compared with 7 baselines, including both commercial tools and research methods on 8 metrics, suggest the superiority of our approach. Project Page: https://follow-your-click.github.io/

* Project Page: https://follow-your-click.github.io/ Github Page: https://github.com/mayuelala/FollowYourClick

Via

Access Paper or Ask Questions

Cross-Cluster Shifting for Efficient and Effective 3D Object Detection in Autonomous Driving

Mar 10, 2024

Zhili Chen, Kien T. Pham, Maosheng Ye, Zhiqiang Shen, Qifeng Chen

Figure 1 for Cross-Cluster Shifting for Efficient and Effective 3D Object Detection in Autonomous Driving

Figure 2 for Cross-Cluster Shifting for Efficient and Effective 3D Object Detection in Autonomous Driving

Figure 3 for Cross-Cluster Shifting for Efficient and Effective 3D Object Detection in Autonomous Driving

Figure 4 for Cross-Cluster Shifting for Efficient and Effective 3D Object Detection in Autonomous Driving

Abstract:We present a new 3D point-based detector model, named Shift-SSD, for precise 3D object detection in autonomous driving. Traditional point-based 3D object detectors often employ architectures that rely on a progressive downsampling of points. While this method effectively reduces computational demands and increases receptive fields, it will compromise the preservation of crucial non-local information for accurate 3D object detection, especially in the complex driving scenarios. To address this, we introduce an intriguing Cross-Cluster Shifting operation to unleash the representation capacity of the point-based detector by efficiently modeling longer-range inter-dependency while including only a negligible overhead. Concretely, the Cross-Cluster Shifting operation enhances the conventional design by shifting partial channels from neighboring clusters, which enables richer interaction with non-local regions and thus enlarges the receptive field of clusters. We conduct extensive experiments on the KITTI, Waymo, and nuScenes datasets, and the results demonstrate the state-of-the-art performance of Shift-SSD in both detection accuracy and runtime efficiency.

* ICRA2024

Via

Access Paper or Ask Questions

Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Feb 27, 2024

Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, Qifeng Chen

Figure 1 for Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Figure 2 for Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Figure 3 for Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Figure 4 for Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Abstract:Video and audio content creation serves as the core technique for the movie industry and professional users. Recently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we propose to bridge the existing strong models with a shared latent representation space. Specifically, we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions, we show the superior performance of our method on joint video-audio generation, visual-steered audio generation, and audio-steered visual generation tasks. The project website can be found at https://yzxing87.github.io/Seeing-and-Hearing/

* Accepted to CVPR 2024. Project website: https://yzxing87.github.io/Seeing-and-Hearing/

Via

Access Paper or Ask Questions

Real-time 3D-aware Portrait Editing from a Single Image

Feb 21, 2024

Qingyan Bai, Yinghao Xu, Zifan Shi, Hao Ouyang, Qiuyu Wang, Ceyuan Yang, Xuan Wang, Gordon Wetzstein, Yujun Shen, Qifeng Chen

Figure 1 for Real-time 3D-aware Portrait Editing from a Single Image

Figure 2 for Real-time 3D-aware Portrait Editing from a Single Image

Figure 3 for Real-time 3D-aware Portrait Editing from a Single Image

Figure 4 for Real-time 3D-aware Portrait Editing from a Single Image

Abstract:This work presents 3DPE, a practical tool that can efficiently edit a face image following given prompts, like reference images or text descriptions, in the 3D-aware manner. To this end, a lightweight module is distilled from a 3D portrait generator and a text-to-image model, which provide prior knowledge of face geometry and open-vocabulary editing capability, respectively. Such a design brings two compelling advantages over existing approaches. First, our system achieves real-time editing with a feedforward network (i.e., ~0.04s per image), over 100x faster than the second competitor. Second, thanks to the powerful priors, our module could focus on the learning of editing-related variations, such that it manages to handle various types of editing simultaneously in the training phase and further supports fast adaptation to user-specified novel types of editing during inference (e.g., with ~5min fine-tuning per case). The code, the model, and the interface will be made publicly available to facilitate future research.

Via

Access Paper or Ask Questions

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Feb 16, 2024

Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen(+2 more)

Figure 1 for Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Figure 2 for Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Figure 3 for Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Figure 4 for Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Abstract:Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data. Adapting large pre-trained diffusion models for higher resolution demands substantial computational and optimization resources, yet achieving a generation capability comparable to low-resolution models remains elusive. This paper proposes a novel self-cascade diffusion model that leverages the rich knowledge gained from a well-trained low-resolution model for rapid adaptation to higher-resolution image and video generation, employing either tuning-free or cheap upsampler tuning paradigms. Integrating a sequence of multi-scale upsampler modules, the self-cascade diffusion model can efficiently adapt to a higher resolution, preserving the original composition and generation capabilities. We further propose a pivot-guided noise re-schedule strategy to speed up the inference process and improve local structural details. Compared to full fine-tuning, our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Extensive experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.

* Project Page: https://guolanqing.github.io/Self-Cascade/

Via

Access Paper or Ask Questions

Using Left and Right Brains Together: Towards Vision and Language Planning

Feb 16, 2024

Jun Cen, Chenfei Wu, Xiao Liu, Shengming Yin, Yixuan Pei, Jinglong Yang, Qifeng Chen, Nan Duan, Jianguo Zhang

Figure 1 for Using Left and Right Brains Together: Towards Vision and Language Planning

Figure 2 for Using Left and Right Brains Together: Towards Vision and Language Planning

Figure 3 for Using Left and Right Brains Together: Towards Vision and Language Planning

Figure 4 for Using Left and Right Brains Together: Towards Vision and Language Planning

Abstract:Large Language Models (LLMs) and Large Multi-modality Models (LMMs) have demonstrated remarkable decision masking capabilities on a variety of tasks. However, they inherently operate planning within the language space, lacking the vision and spatial imagination ability. In contrast, humans utilize both left and right hemispheres of the brain for language and visual planning during the thinking process. Therefore, we introduce a novel vision-language planning framework in this work to perform concurrent visual and language planning for tasks with inputs of any form. Our framework incorporates visual planning to capture intricate environmental details, while language planning enhances the logical coherence of the overall system. We evaluate the effectiveness of our framework across vision-language tasks, vision-only tasks, and language-only tasks. The results demonstrate the superior performance of our approach, indicating that the integration of visual and language planning yields better contextually aware task execution.

* 19 pages, 13 figures

Via

Access Paper or Ask Questions

ENTED: Enhanced Neural Texture Extraction and Distribution for Reference-based Blind Face Restoration

Jan 13, 2024

Yuen-Fui Lau, Tianjia Zhang, Zhefan Rao, Qifeng Chen

Abstract:We present ENTED, a new framework for blind face restoration that aims to restore high-quality and realistic portrait images. Our method involves repairing a single degraded input image using a high-quality reference image. We utilize a texture extraction and distribution framework to transfer high-quality texture features between the degraded input and reference image. However, the StyleGAN-like architecture in our framework requires high-quality latent codes to generate realistic images. The latent code extracted from the degraded input image often contains corrupted features, making it difficult to align the semantic information from the input with the high-quality textures from the reference. To overcome this challenge, we employ two special techniques. The first technique, inspired by vector quantization, replaces corrupted semantic features with high-quality code words. The second technique generates style codes that carry photorealistic texture information from a more informative latent space developed using the high-quality features in the reference image's manifold. Extensive experiments conducted on synthetic and real-world datasets demonstrate that our method produces results with more realistic contextual details and outperforms state-of-the-art methods. A thorough ablation study confirms the effectiveness of each proposed module.

Via

Access Paper or Ask Questions

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

Jan 09, 2024

Junming Chen, Yunfei Liu, Jianan Wang, Ailing Zeng, Yu Li, Qifeng Chen

Abstract:We propose DiffSHEG, a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. While previous works focused on co-speech gesture or expression generation individually, the joint generation of synchronized expressions and gestures remains barely explored. To address this, our diffusion-based co-speech motion generation transformer enables uni-directional information flow from expression to gesture, facilitating improved matching of joint expression-gesture distributions. Furthermore, we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models, offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets, our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally, a user study confirms the superiority of DiffSHEG over prior approaches. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.

Via

Access Paper or Ask Questions

TIP: Text-Driven Image Processing with Semantic and Restoration Instructions

Dec 18, 2023

Chenyang Qi, Zhengzhong Tu, Keren Ye, Mauricio Delbracio, Peyman Milanfar, Qifeng Chen, Hossein Talebi

Abstract:Text-driven diffusion models have become increasingly popular for various image editing tasks, including inpainting, stylization, and object replacement. However, it still remains an open research problem to adopt this language-vision paradigm for more fine-level image processing tasks, such as denoising, super-resolution, deblurring, and compression artifact removal. In this paper, we develop TIP, a Text-driven Image Processing framework that leverages natural language as a user-friendly interface to control the image restoration process. We consider the capacity of text information in two dimensions. First, we use content-related prompts to enhance the semantic alignment, effectively alleviating identity ambiguity in the restoration outcomes. Second, our approach is the first framework that supports fine-level instruction through language-based quantitative specification of the restoration strength, without the need for explicit task-specific design. In addition, we introduce a novel fusion mechanism that augments the existing ControlNet architecture by learning to rescale the generative prior, thereby achieving better restoration fidelity. Our extensive experiments demonstrate the superior restoration performance of TIP compared to the state of the arts, alongside offering the flexibility of text-based control over the restoration effects.

* Webpage: https://chenyangqiqi.github.io/tip ; code will be released soon

Via

Access Paper or Ask Questions

MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling via Multi-Layered Semantic-Aware Denoising

Dec 18, 2023

Bingyuan Wang, Hengyu Meng, Zeyu Cai, Lanjiong Li, Yue Ma, Qifeng Chen, Zeyu Wang

Abstract:Visual storytelling often uses nontypical aspect-ratio images like scroll paintings, comic strips, and panoramas to create an expressive and compelling narrative. While generative AI has achieved great success and shown the potential to reshape the creative industry, it remains a challenge to generate coherent and engaging content with arbitrary size and controllable style, concept, and layout, all of which are essential for visual storytelling. To overcome the shortcomings of previous methods including repetitive content, style inconsistency, and lack of controllability, we propose MagicScroll, a multi-layered, progressive diffusion-based image generation framework with a novel semantic-aware denoising process. The model enables fine-grained control over the generated image on object, scene, and background levels with text, image, and layout conditions. We also establish the first benchmark for nontypical aspect-ratio image generation for visual storytelling including mediums like paintings, comics, and cinematic panoramas, with customized metrics for systematic evaluation. Through comparative and ablation studies, MagicScroll showcases promising results in aligning with the narrative text, improving visual coherence, and engaging the audience. We plan to release the code and benchmark in the hope of a better collaboration between AI researchers and creative practitioners involving visual storytelling.

* Project page: https://magicscroll.github.io/

Via

Access Paper or Ask Questions