Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Qifeng Chen

EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing

Mar 13, 2025

Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, Linfeng Zhang

Figure 1 for EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing

Figure 2 for EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing

Figure 3 for EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing

Figure 4 for EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing

Abstract:Inversion-based image editing is rapidly gaining momentum while suffering from significant computation overhead, hindering its application in real-time interactive scenarios. In this paper, we rethink that the redundancy in inversion-based image editing exists in both the spatial and temporal dimensions, such as the unnecessary computation in unedited regions and the redundancy in the inversion progress. To tackle these challenges, we propose a practical framework, named EEdit, to achieve efficient image editing. Specifically, we introduce three techniques to solve them one by one. For spatial redundancy, spatial locality caching is introduced to compute the edited region and its neighboring regions while skipping the unedited regions, and token indexing preprocessing is designed to further accelerate the caching. For temporal redundancy, inversion step skipping is proposed to reuse the latent for efficient editing. Our experiments demonstrate an average of 2.46 $\times$ acceleration without performance drop in a wide range of editing tasks including prompt-guided image editing, dragging and image composition. Our codes are available at https://github.com/yuriYanZeXuan/EEdit

* 17 pages

Via

Access Paper or Ask Questions

Generative Artificial Intelligence in Robotic Manipulation: A Survey

Mar 05, 2025

Kun Zhang, Peng Yun, Jun Cen, Junhao Cai, Didi Zhu, Hangjie Yuan, Chao Zhao, Tao Feng, Michael Yu Wang, Qifeng Chen(+3 more)

Figure 1 for Generative Artificial Intelligence in Robotic Manipulation: A Survey

Figure 2 for Generative Artificial Intelligence in Robotic Manipulation: A Survey

Figure 3 for Generative Artificial Intelligence in Robotic Manipulation: A Survey

Figure 4 for Generative Artificial Intelligence in Robotic Manipulation: A Survey

Abstract:This survey provides a comprehensive review on recent advancements of generative learning models in robotic manipulation, addressing key challenges in the field. Robotic manipulation faces critical bottlenecks, including significant challenges in insufficient data and inefficient data acquisition, long-horizon and complex task planning, and the multi-modality reasoning ability for robust policy learning performance across diverse environments. To tackle these challenges, this survey introduces several generative model paradigms, including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), diffusion models, probabilistic flow models, and autoregressive models, highlighting their strengths and limitations. The applications of these models are categorized into three hierarchical layers: the Foundation Layer, focusing on data generation and reward generation; the Intermediate Layer, covering language, code, visual, and state generation; and the Policy Layer, emphasizing grasp generation and trajectory generation. Each layer is explored in detail, along with notable works that have advanced the state of the art. Finally, the survey outlines future research directions and challenges, emphasizing the need for improved efficiency in data utilization, better handling of long-horizon tasks, and enhanced generalization across diverse robotic scenarios. All the related resources, including research papers, open-source data, and projects, are collected for the community in https://github.com/GAI4Manipulation/AwesomeGAIManipulation

Via

Access Paper or Ask Questions

MangaNinja: Line Art Colorization with Precise Reference Following

Jan 14, 2025

Zhiheng Liu, Ka Leong Cheng, Xi Chen, Jie Xiao, Hao Ouyang, Kai Zhu, Yu Liu, Yujun Shen, Qifeng Chen, Ping Luo

Figure 1 for MangaNinja: Line Art Colorization with Precise Reference Following

Figure 2 for MangaNinja: Line Art Colorization with Precise Reference Following

Figure 3 for MangaNinja: Line Art Colorization with Precise Reference Following

Figure 4 for MangaNinja: Line Art Colorization with Precise Reference Following

Abstract:Derived from diffusion models, MangaNinjia specializes in the task of reference-guided line art colorization. We incorporate two thoughtful designs to ensure precise character detail transcription, including a patch shuffling module to facilitate correspondence learning between the reference color image and the target line art, and a point-driven control scheme to enable fine-grained color matching. Experiments on a self-collected benchmark demonstrate the superiority of our model over current solutions in terms of precise colorization. We further showcase the potential of the proposed interactive point control in handling challenging cases, cross-character colorization, multi-reference harmonization, beyond the reach of existing algorithms.

* Project page and code: https://johanan528.github.io/MangaNinjia/

Via

Access Paper or Ask Questions

Edicho: Consistent Image Editing in the Wild

Dec 30, 2024

Qingyan Bai, Hao Ouyang, Yinghao Xu, Qiuyu Wang, Ceyuan Yang, Ka Leong Cheng, Yujun Shen, Qifeng Chen

Abstract:As a verified need, consistent editing across in-the-wild images remains a technical challenge arising from various unmanageable factors, like object poses, lighting conditions, and photography environments. Edicho steps in with a training-free solution based on diffusion models, featuring a fundamental design principle of using explicit image correspondence to direct editing. Specifically, the key components include an attention manipulation module and a carefully refined classifier-free guidance (CFG) denoising strategy, both of which take into account the pre-estimated correspondence. Such an inference-time algorithm enjoys a plug-and-play nature and is compatible to most diffusion-based editing methods, such as ControlNet and BrushNet. Extensive results demonstrate the efficacy of Edicho in consistent cross-image editing under diverse settings. We will release the code to facilitate future studies.

* Project page: https://github.com/EzioBy/edicho

Via

Access Paper or Ask Questions

ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement

Dec 25, 2024

Zhefan Rao, Liya Ji, Yazhou Xing, Runtao Liu, Zhaoyang Liu, Jiaxin Xie, Ziqiao Peng, Yingqing He, Qifeng Chen

Abstract:Text-to-video (T2V) generation has gained significant attention recently. However, the costs of training a T2V model from scratch remain persistently high, and there is considerable room for improving the generation performance, especially under limited computation resources. This work explores the continual general pre-training of text-to-video models, enabling the model to "grow" its abilities based on a pre-trained foundation, analogous to how humans acquire new knowledge based on past experiences. There is a lack of extensive study of the continual pre-training techniques in T2V generation. In this work, we take the initial step toward exploring this task systematically and propose ModelGrow. Specifically, we break this task into two key aspects: increasing model capacity and improving semantic understanding. For model capacity, we introduce several novel techniques to expand the model size, enabling it to store new knowledge and improve generation performance. For semantic understanding, we propose a method that leverages large language models as advanced text encoders, integrating them into T2V models to enhance language comprehension and guide generation results according to detailed prompts. This approach enables the model to achieve better semantic alignment, particularly in response to complex user prompts. Extensive experiments demonstrate the effectiveness of our method across various metrics. The source code and the model of ModelGrow will be publicly available.

* 18 pages

Via

Access Paper or Ask Questions

DepthLab: From Partial to Complete

Dec 24, 2024

Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng Chen, Ping Luo

Figure 1 for DepthLab: From Partial to Complete

Figure 2 for DepthLab: From Partial to Complete

Figure 3 for DepthLab: From Partial to Complete

Figure 4 for DepthLab: From Partial to Complete

Abstract:Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration. This work bridges this gap with DepthLab, a foundation depth inpainting model powered by image diffusion priors. Our model features two notable strengths: (1) it demonstrates resilience to depth-deficient regions, providing reliable completion for both continuous areas and isolated points, and (2) it faithfully preserves scale consistency with the conditioned known depth when filling in missing values. Drawing on these advantages, our approach proves its worth in various downstream tasks, including 3D scene inpainting, text-to-3D scene generation, sparse-view reconstruction with DUST3R, and LiDAR depth completion, exceeding current solutions in both numerical performance and visual quality. Our project page with source code is available at https://johanan528.github.io/depthlab_web/.

* Project page and code: https://johanan528.github.io/depthlab_web/

Via

Access Paper or Ask Questions

Large Motion Video Autoencoding with Cross-modal Video VAE

Dec 23, 2024

Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen

Figure 1 for Large Motion Video Autoencoding with Cross-modal Video VAE

Figure 2 for Large Motion Video Autoencoding with Cross-modal Video VAE

Figure 3 for Large Motion Video Autoencoding with Cross-modal Video VAE

Figure 4 for Large Motion Video Autoencoding with Cross-modal Video VAE

Abstract:Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model for further temporal compression. Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model. This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability. Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding. Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. The project website can be found at~\href{https://yzxing87.github.io/vae/}{https://yzxing87.github.io/vae/}.

* Project Website: https://yzxing87.github.io/vae/

Via

Access Paper or Ask Questions

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Dec 19, 2024

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Qifeng Chen, Yujun Shen, Limin Wang

Figure 1 for LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Figure 2 for LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Figure 3 for LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Figure 4 for LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Abstract:The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity. We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points. These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal. Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images. Project page: https://ppetrichor.github.io/levitor.github.io/

* Project page available at https://ppetrichor.github.io/levitor.github.io/

Via

Access Paper or Ask Questions

VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Dec 18, 2024

Runtao Liu, Haoyu Wu, Zheng Ziqiang, Chen Wei, Yingqing He, Renjie Pi, Qifeng Chen

Figure 1 for VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Figure 2 for VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Figure 3 for VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Figure 4 for VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Abstract:Recent progress in generative diffusion models has greatly advanced text-to-video generation. While text-to-video models trained on large-scale, diverse datasets can produce varied outputs, these generations often deviate from user preferences, highlighting the need for preference alignment on pre-trained models. Although Direct Preference Optimization (DPO) has demonstrated significant improvements in language and image generation, we pioneer its adaptation to video diffusion models and propose a VideoDPO pipeline by making several key adjustments. Unlike previous image alignment methods that focus solely on either (i) visual quality or (ii) semantic alignment between text and videos, we comprehensively consider both dimensions and construct a preference score accordingly, which we term the OmniScore. We design a pipeline to automatically collect preference pair data based on the proposed OmniScore and discover that re-weighting these pairs based on the score significantly impacts overall preference alignment. Our experiments demonstrate substantial improvements in both visual quality and semantic alignment, ensuring that no preference aspect is neglected. Code and data will be shared at https://videodpo.github.io/.

Via

Access Paper or Ask Questions

SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation

Dec 13, 2024

Runtao Liu, Chen I Chieh, Jindong Gu, Jipeng Zhang, Renjie Pi, Qifeng Chen, Philip Torr, Ashkan Khakzar, Fabio Pizzati

Figure 1 for SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation

Figure 2 for SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation

Figure 3 for SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation

Figure 4 for SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation

Abstract:Text-to-image (T2I) models have become widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse. Current safety measures are typically limited to text-based filtering or concept removal strategies, able to remove just a few concepts from the model's generative capabilities. In this work, we introduce SafetyDPO, a method for safety alignment of T2I models through Direct Preference Optimization (DPO). We enable the application of DPO for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2. Using a custom DPO strategy and this dataset, we train safety experts, in the form of low-rank adaptation (LoRA) matrices, able to guide the generation process away from specific safety-related concepts. Then, we merge the experts into a single LoRA using a novel merging strategy for optimal scaling performance. This expert-based approach enables scalability, allowing us to remove 7 times more harmful concepts from T2I models compared to baselines. SafetyDPO consistently outperforms the state-of-the-art on many benchmarks and establishes new practices for safety alignment in T2I networks. Code and data will be shared at https://safetydpo.github.io/.

Via

Access Paper or Ask Questions