Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaochao Qu

Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation

Jan 17, 2026

Zijie Lou, Xiangwei Feng, Jiaxin Wang, Xiaochao Qu, Luoqi Liu, Ting Liu

Abstract:Existing video object removal methods predominantly rely on diffusion models following a noise-to-data paradigm, where generation starts from uninformative Gaussian noise. This approach discards the rich structural and contextual priors present in the original input video. Consequently, such methods often lack sufficient guidance, leading to incomplete object erasure or the synthesis of implausible content that conflicts with the scene's physical logic. In this paper, we reformulate video object removal as a video-to-video translation task via a stochastic bridge model. Unlike noise-initialized methods, our framework establishes a direct stochastic path from the source video (with objects) to the target video (objects removed). This bridge formulation effectively leverages the input video as a strong structural prior, guiding the model to perform precise removal while ensuring that the filled regions are logically consistent with the surrounding environment. To address the trade-off where strong bridge priors hinder the removal of large objects, we propose a novel adaptive mask modulation strategy. This mechanism dynamically modulates input embeddings based on mask characteristics, balancing background fidelity with generative flexibility. Extensive experiments demonstrate that our approach significantly outperforms existing methods in both visual quality and temporal consistency.

Via

Access Paper or Ask Questions

On Exact Editing of Flow-Based Diffusion Models

Dec 30, 2025

Zixiang Li, Yue Song, Jianing Peng, Ting Liu, Jun Huang, Xiaochao Qu, Luoqi Liu, Wei Wang, Yao Zhao, Yunchao Wei

Abstract:Recent methods in flow-based diffusion editing have enabled direct transformations between source and target image distribution without explicit inversion. However, the latent trajectories in these methods often exhibit accumulated velocity errors, leading to semantic inconsistency and loss of structural fidelity. We propose Conditioned Velocity Correction (CVC), a principled framework that reformulates flow-based editing as a distribution transformation problem driven by a known source prior. CVC rethinks the role of velocity in inter-distribution transformation by introducing a dual-perspective velocity conversion mechanism. This mechanism explicitly decomposes the latent evolution into two components: a structure-preserving branch that remains consistent with the source trajectory, and a semantically-guided branch that drives a controlled deviation toward the target distribution. The conditional velocity field exhibits an absolute velocity error relative to the true underlying distribution trajectory, which inherently introduces potential instability and trajectory drift in the latent space. To address this quantifiable deviation and maintain fidelity to the true flow, we apply a posterior-consistent update to the resulting conditional velocity field. This update is derived from Empirical Bayes Inference and Tweedie correction, which ensures a mathematically grounded error compensation over time. Our method yields stable and interpretable latent dynamics, achieving faithful reconstruction alongside smooth local semantic conversion. Comprehensive experiments demonstrate that CVC consistently achieves superior fidelity, better semantic alignment, and more reliable editing behavior across diverse tasks.

Via

Access Paper or Ask Questions

AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment

May 28, 2025

Yiheng Lin, Shifang Zhao, Ting Liu, Xiaochao Qu, Luoqi Liu, Yao Zhao, Yunchao Wei

Figure 1 for AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment

Figure 2 for AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment

Figure 3 for AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment

Figure 4 for AlignGen: Boosting Personalized Image Generation with Cross-Modality Prior Alignment

Abstract:Personalized image generation aims to integrate user-provided concepts into text-to-image models, enabling the generation of customized content based on a given prompt. Recent zero-shot approaches, particularly those leveraging diffusion transformers, incorporate reference image information through multi-modal attention mechanism. This integration allows the generated output to be influenced by both the textual prior from the prompt and the visual prior from the reference image. However, we observe that when the prompt and reference image are misaligned, the generated results exhibit a stronger bias toward the textual prior, leading to a significant loss of reference content. To address this issue, we propose AlignGen, a Cross-Modality Prior Alignment mechanism that enhances personalized image generation by: 1) introducing a learnable token to bridge the gap between the textual and visual priors, 2) incorporating a robust training strategy to ensure proper prior alignment, and 3) employing a selective cross-modal attention mask within the multi-modal attention mechanism to further align the priors. Experimental results demonstrate that AlignGen outperforms existing zero-shot methods and even surpasses popular test-time optimization approaches.

Via

Access Paper or Ask Questions

GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

May 08, 2025

Tong Wang, Ting Liu, Xiaochao Qu, Chengjing Wu, Luoqi Liu, Xiaolin Hu

Figure 1 for GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

Figure 2 for GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

Figure 3 for GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

Figure 4 for GlyphMastero: A Glyph Encoder for High-Fidelity Scene Text Editing

Abstract:Scene text editing, a subfield of image editing, requires modifying texts in images while preserving style consistency and visual coherence with the surrounding environment. While diffusion-based methods have shown promise in text generation, they still struggle to produce high-quality results. These methods often generate distorted or unrecognizable characters, particularly when dealing with complex characters like Chinese. In such systems, characters are composed of intricate stroke patterns and spatial relationships that must be precisely maintained. We present GlyphMastero, a specialized glyph encoder designed to guide the latent diffusion model for generating texts with stroke-level precision. Our key insight is that existing methods, despite using pretrained OCR models for feature extraction, fail to capture the hierarchical nature of text structures - from individual strokes to stroke-level interactions to overall character-level structure. To address this, our glyph encoder explicitly models and captures the cross-level interactions between local-level individual characters and global-level text lines through our novel glyph attention module. Meanwhile, our model implements a feature pyramid network to fuse the multi-scale OCR backbone features at the global-level. Through these cross-level and multi-scale fusions, we obtain more detailed glyph-aware guidance, enabling precise control over the scene text generation process. Our method achieves an 18.02\% improvement in sentence accuracy over the state-of-the-art multi-lingual scene text editing baseline, while simultaneously reducing the text-region Fr\'echet inception distance by 53.28\%.

* CVPR 2025

Via

Access Paper or Ask Questions

DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics

Mar 21, 2025

Yihan Hu, Jianing Peng, Yiheng Lin, Ting Liu, Xiaochao Qu, Luoqi Liu, Yao Zhao, Yunchao Wei

Abstract:This paper presents a novel approach to improving text-guided image editing using diffusion-based models. Text-guided image editing task poses key challenge of precisly locate and edit the target semantic, and previous methods fall shorts in this aspect. Our method introduces a Precise Semantic Localization strategy that leverages visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance. Then we propose a Dual-Level Control mechanism for incorporating regional cues at both feature and latent levels, offering fine-grained control for more precise edits. To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task. Experimental results on the popular PIE-Bench and RW-800 benchmarks demonstrate the superior performance of our approach in preserving background and providing accurate edits.

Via

Access Paper or Ask Questions

Memory Efficient Matting with Adaptive Token Routing

Dec 17, 2024

Yiheng Lin, Yihan Hu, Chenyi Zhang, Ting Liu, Xiaochao Qu, Luoqi Liu, Yao Zhao, Yunchao Wei

Figure 1 for Memory Efficient Matting with Adaptive Token Routing

Figure 2 for Memory Efficient Matting with Adaptive Token Routing

Figure 3 for Memory Efficient Matting with Adaptive Token Routing

Figure 4 for Memory Efficient Matting with Adaptive Token Routing

Abstract:Transformer-based models have recently achieved outstanding performance in image matting. However, their application to high-resolution images remains challenging due to the quadratic complexity of global self-attention. To address this issue, we propose MEMatte, a \textbf{m}emory-\textbf{e}fficient \textbf{m}atting framework for processing high-resolution images. MEMatte incorporates a router before each global attention block, directing informative tokens to the global attention while routing other tokens to a Lightweight Token Refinement Module (LTRM). Specifically, the router employs a local-global strategy to predict the routing probability of each token, and the LTRM utilizes efficient modules to simulate global attention. Additionally, we introduce a Batch-constrained Adaptive Token Routing (BATR) mechanism, which allows each router to dynamically route tokens based on image content and the stages of attention block in the network. Furthermore, we construct an ultra high-resolution image matting dataset, UHR-395, comprising 35,500 training images and 1,000 test images, with an average resolution of $4872\times6017$. This dataset is created by compositing 395 different alpha mattes across 11 categories onto various backgrounds, all with high-quality manual annotation. Extensive experiments demonstrate that MEMatte outperforms existing methods on both high-resolution and real-world datasets, significantly reducing memory usage by approximately 88% and latency by 50% on the Composition-1K benchmark. Our code is available at https://github.com/linyiheng123/MEMatte.

Via

Access Paper or Ask Questions

Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

Aug 25, 2024

Minghao Liu, Le Zhang, Yingjie Tian, Xiaochao Qu, Luoqi Liu, Ting Liu

Figure 1 for Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

Figure 2 for Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

Figure 3 for Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

Figure 4 for Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

Abstract:Recent advances in text-to-image diffusion models have demonstrated impressive capabilities in image quality. However, complex scene generation remains relatively unexplored, and even the definition of `complex scene' itself remains unclear. In this paper, we address this gap by providing a precise definition of complex scenes and introducing a set of Complex Decomposition Criteria (CDC) based on this definition. Inspired by the artists painting process, we propose a training-free diffusion framework called Complex Diffusion (CxD), which divides the process into three stages: composition, painting, and retouching. Our method leverages the powerful chain-of-thought capabilities of large language models (LLMs) to decompose complex prompts based on CDC and to manage composition and layout. We then develop an attention modulation method that guides simple prompts to specific regions to complete the complex scene painting. Finally, we inject the detailed output of the LLM into a retouching model to enhance the image details, thus implementing the retouching stage. Extensive experiments demonstrate that our method outperforms previous SOTA approaches, significantly improving the generation of high-quality, semantically consistent, and visually diverse images for complex scenes, even with intricate prompts.

Via

Access Paper or Ask Questions

SAM-REF: Rethinking Image-Prompt Synergy for Refinement in Segment Anything

Aug 22, 2024

Chongkai Yu, Anqi Li, Xiaochao Qu, Luoqi Liu, Ting Liu

Figure 1 for SAM-REF: Rethinking Image-Prompt Synergy for Refinement in Segment Anything

Figure 2 for SAM-REF: Rethinking Image-Prompt Synergy for Refinement in Segment Anything

Figure 3 for SAM-REF: Rethinking Image-Prompt Synergy for Refinement in Segment Anything

Figure 4 for SAM-REF: Rethinking Image-Prompt Synergy for Refinement in Segment Anything

Abstract:The advent of the Segment Anything Model (SAM) marks a significant milestone for interactive segmentation using generalist models. As a late fusion model, SAM extracts image embeddings once and merges them with prompts in later interactions. This strategy limits the models ability to extract detailed information from the prompted target zone. Current specialist models utilize the early fusion strategy that encodes the combination of images and prompts to target the prompted objects, yet repetitive complex computations on the images result in high latency. The key to these issues is efficiently synergizing the images and prompts. We propose SAM-REF, a two-stage refinement framework that fully integrates images and prompts globally and locally while maintaining the accuracy of early fusion and the efficiency of late fusion. The first-stage GlobalDiff Refiner is a lightweight early fusion network that combines the whole image and prompts, focusing on capturing detailed information for the entire object. The second-stage PatchDiff Refiner locates the object detail window according to the mask and prompts, then refines the local details of the object. Experimentally, we demonstrated the high effectiveness and efficiency of our method in tackling complex cases with multiple interactions. Our SAM-REF model outperforms the current state-of-the-art method in most metrics on segmentation quality without compromising efficiency.

Via

Access Paper or Ask Questions

TextMastero: Mastering High-Quality Scene Text Editing in Diverse Languages and Styles

Aug 20, 2024

Tong Wang, Xiaochao Qu, Ting Liu

Abstract:Scene text editing aims to modify texts on images while maintaining the style of newly generated text similar to the original. Given an image, a target area, and target text, the task produces an output image with the target text in the selected area, replacing the original. This task has been studied extensively, with initial success using Generative Adversarial Networks (GANs) to balance text fidelity and style similarity. However, GAN-based methods struggled with complex backgrounds or text styles. Recent works leverage diffusion models, showing improved results, yet still face challenges, especially with non-Latin languages like CJK characters (Chinese, Japanese, Korean) that have complex glyphs, often producing inaccurate or unrecognizable characters. To address these issues, we present \emph{TextMastero} - a carefully designed multilingual scene text editing architecture based on latent diffusion models (LDMs). TextMastero introduces two key modules: a glyph conditioning module for fine-grained content control in generating accurate texts, and a latent guidance module for providing comprehensive style information to ensure similarity before and after editing. Both qualitative and quantitative experiments demonstrate that our method surpasses all known existing works in text fidelity and style similarity.

Via

Access Paper or Ask Questions

Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended?

Aug 20, 2024

Chen Liang, Qiang Guo, Xiaochao Qu, Luoqi Liu, Ting Liu

Abstract:Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames. Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets. This leads to inconsistent segmentation results across frames. To address these issues, we propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation. MVC introduces a training strategy that randomly masks image patches, compelling the network to predict the entire semantic segmentation, thus improving contextual information integration. Additionally, we introduce Object Masked Attention (OMA) to optimize the cross-attention mechanism by reducing the impact of irrelevant queries, thereby enhancing temporal modeling capabilities. Our approach, integrated into the latest decoupled universal video segmentation framework, achieves state-of-the-art performance across five datasets for three video segmentation tasks, demonstrating significant improvements over previous methods without increasing model parameters.

Via

Access Paper or Ask Questions