Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yunlong Yu

Mark

CoAR: Concept Injection into Autoregressive Models for Personalized Text-to-Image Generation

Aug 10, 2025

Fangtai Wu, Mushui Liu, Weijie He, Wanggui He, Hao Jiang, Zhao Wang, Yunlong Yu

Figure 1 for CoAR: Concept Injection into Autoregressive Models for Personalized Text-to-Image Generation

Figure 2 for CoAR: Concept Injection into Autoregressive Models for Personalized Text-to-Image Generation

Figure 3 for CoAR: Concept Injection into Autoregressive Models for Personalized Text-to-Image Generation

Figure 4 for CoAR: Concept Injection into Autoregressive Models for Personalized Text-to-Image Generation

Abstract:The unified autoregressive (AR) model excels at multimodal understanding and generation, but its potential for customized image generation remains underexplored. Existing customized generation methods rely on full fine-tuning or adapters, making them costly and prone to overfitting or catastrophic forgetting. In this paper, we propose \textbf{CoAR}, a novel framework for injecting subject concepts into the unified AR models while keeping all pre-trained parameters completely frozen. CoAR learns effective, specific subject representations with only a minimal number of parameters using a Layerwise Multimodal Context Learning strategy. To address overfitting and language drift, we further introduce regularization that preserves the pre-trained distribution and anchors context tokens to improve subject fidelity and re-contextualization. Additionally, CoAR supports training-free subject customization in a user-provided style. Experiments demonstrate that CoAR achieves superior performance on both subject-driven personalization and style personalization, while delivering significant gains in computational and memory efficiency. Notably, CoAR tunes less than \textbf{0.05\%} of the parameters while achieving competitive performance compared to recent Proxy-Tuning. Code: https://github.com/KZF-kzf/CoAR

Via

Access Paper or Ask Questions

Gradients of unitary optical neural networks using parameter-shift rule

Jun 13, 2025

Jinzhe Jiang, Yaqian Zhao, Xin Zhang, Chen Li, Yunlong Yu, Hailing Liu

Abstract:This paper explores the application of the parameter-shift rule (PSR) for computing gradients in unitary optical neural networks (UONNs). While backpropagation has been fundamental to training conventional neural networks, its implementation in optical neural networks faces significant challenges due to the physical constraints of optical systems. We demonstrate how PSR, which calculates gradients by evaluating functions at shifted parameter values, can be effectively adapted for training UONNs constructed from Mach-Zehnder interferometer meshes. The method leverages the inherent Fourier series nature of optical interference in these systems to compute exact analytical gradients directly from hardware measurements. This approach offers a promising alternative to traditional in silico training methods and circumvents the limitations of both finite difference approximations and all-optical backpropagation implementations. We present the theoretical framework and practical methodology for applying PSR to optimize phase parameters in optical neural networks, potentially advancing the development of efficient hardware-based training strategies for optical computing systems.

* 8 pages, 3 figures

Via

Access Paper or Ask Questions

DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation

Apr 21, 2025

Weijie He, Mushui Liu, Yunlong Yu, Zhao Wang, Chao Wu

Figure 1 for DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation

Figure 2 for DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation

Figure 3 for DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation

Figure 4 for DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation

Abstract:Compositional text-to-video generation, which requires synthesizing dynamic scenes with multiple interacting entities and precise spatial-temporal relationships, remains a critical challenge for diffusion-based models. Existing methods struggle with layout discontinuity, entity identity drift, and implausible interaction dynamics due to unconstrained cross-attention mechanisms and inadequate physics-aware reasoning. To address these limitations, we propose DyST-XL, a \textbf{training-free} framework that enhances off-the-shelf text-to-video models (e.g., CogVideoX-5B) through frame-aware control. DyST-XL integrates three key innovations: (1) A Dynamic Layout Planner that leverages large language models (LLMs) to parse input prompts into entity-attribute graphs and generates physics-aware keyframe layouts, with intermediate frames interpolated via trajectory optimization; (2) A Dual-Prompt Controlled Attention Mechanism that enforces localized text-video alignment through frame-aware attention masking, achieving the precise control over individual entities; and (3) An Entity-Consistency Constraint strategy that propagates first-frame feature embeddings to subsequent frames during denoising, preserving object identity without manual annotation. Experiments demonstrate that DyST-XL excels in compositional text-to-video generation, significantly improving performance on complex prompts and bridging a crucial gap in training-free video synthesis.

* 9 pages, 6 figures

Via

Access Paper or Ask Questions

CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers

Feb 10, 2025

D. She, Mushui Liu, Jingxuan Pang, Jin Wang, Zhen Yang, Wanggui He, Guanghao Zhang, Yi Wang, Qihan Huang, Haobin Tang(+2 more)

Abstract:Customized generation has achieved significant progress in image synthesis, yet personalized video generation remains challenging due to temporal inconsistencies and quality degradation. In this paper, we introduce CustomVideoX, an innovative framework leveraging the video diffusion transformer for personalized video generation from a reference image. CustomVideoX capitalizes on pre-trained video networks by exclusively training the LoRA parameters to extract reference features, ensuring both efficiency and adaptability. To facilitate seamless interaction between the reference image and video content, we propose 3D Reference Attention, which enables direct and simultaneous engagement of reference image features with all video frames across spatial and temporal dimensions. To mitigate the excessive influence of reference image features and textual guidance on generated video content during inference, we implement the Time-Aware Reference Attention Bias (TAB) strategy, dynamically modulating reference bias over different time steps. Additionally, we introduce the Entity Region-Aware Enhancement (ERAE) module, aligning highly activated regions of key entity tokens with reference feature injection by adjusting attention bias. To thoroughly evaluate personalized video generation, we establish a new benchmark, VideoBench, comprising over 50 objects and 100 prompts for extensive assessment. Experimental results show that CustomVideoX significantly outperforms existing methods in terms of video consistency and quality.

* 13 pages, 10 figures

Via

Access Paper or Ask Questions

CA-MoE: Channel-Adapted MoE for Incremental Weather Forecasting

Dec 03, 2024

Hao Chen, Han Tao, Guo Song, Jie Zhang, Yunlong Yu, Yonghan Dong, Chuang Yang, Lei Bai

Figure 1 for CA-MoE: Channel-Adapted MoE for Incremental Weather Forecasting

Figure 2 for CA-MoE: Channel-Adapted MoE for Incremental Weather Forecasting

Figure 3 for CA-MoE: Channel-Adapted MoE for Incremental Weather Forecasting

Figure 4 for CA-MoE: Channel-Adapted MoE for Incremental Weather Forecasting

Abstract:Atmospheric science is intricately connected with other fields, e.g., geography and aerospace. Most existing approaches involve training a joint atmospheric and geographic model from scratch, which incurs significant computational costs and overlooks the potential for incremental learning of weather variables across different domains. In this paper, we introduce incremental learning to weather forecasting and propose a novel structure that allows for the flexible expansion of variables within the model. Specifically, our method presents a Channel-Adapted MoE (CA-MoE) that employs a divide-and-conquer strategy. This strategy assigns variable training tasks to different experts by index embedding and reduces computational complexity through a channel-wise Top-K strategy. Experiments conducted on the widely utilized ERA5 dataset reveal that our method, utilizing only approximately 15\% of trainable parameters during the incremental stage, attains performance that is on par with state-of-the-art competitors. Notably, in the context of variable incremental experiments, our method demonstrates negligible issues with catastrophic forgetting.

Via

Access Paper or Ask Questions

RestorerID: Towards Tuning-Free Face Restoration with ID Preservation

Nov 21, 2024

Jiacheng Ying, Mushui Liu, Zhe Wu, Runming Zhang, Zhu Yu, Siming Fu, Si-Yuan Cao, Chao Wu, Yunlong Yu, Hui-Liang Shen

Figure 1 for RestorerID: Towards Tuning-Free Face Restoration with ID Preservation

Figure 2 for RestorerID: Towards Tuning-Free Face Restoration with ID Preservation

Figure 3 for RestorerID: Towards Tuning-Free Face Restoration with ID Preservation

Figure 4 for RestorerID: Towards Tuning-Free Face Restoration with ID Preservation

Abstract:Blind face restoration has made great progress in producing high-quality and lifelike images. Yet it remains challenging to preserve the ID information especially when the degradation is heavy. Current reference-guided face restoration approaches either require face alignment or personalized test-tuning, which are unfaithful or time-consuming. In this paper, we propose a tuning-free method named RestorerID that incorporates ID preservation during face restoration. RestorerID is a diffusion model-based method that restores low-quality images with varying levels of degradation by using a single reference image. To achieve this, we propose a unified framework to combine the ID injection with the base blind face restoration model. In addition, we design a novel Face ID Rebalancing Adapter (FIR-Adapter) to tackle the problems of content unconsistency and contours misalignment that are caused by information conflicts between the low-quality input and reference image. Furthermore, by employing an Adaptive ID-Scale Adjusting strategy, RestorerID can produce superior restored images across various levels of degradation. Experimental results on the Celeb-Ref dataset and real-world scenarios demonstrate that RestorerID effectively delivers high-quality face restoration with ID preservation, achieving a superior performance compared to the test-tuning approaches and other reference-guided ones. The code of RestorerID is available at \url{https://github.com/YingJiacheng/RestorerID}.

* 10 pages, 10 figures

Via

Access Paper or Ask Questions

Hybrid Mask Generation for Infrared Small Target Detection with Single-Point Supervision

Sep 06, 2024

Weijie He, Mushui Liu, Yunlong Yu, Zheming Lu, Xi Li

Figure 1 for Hybrid Mask Generation for Infrared Small Target Detection with Single-Point Supervision

Figure 2 for Hybrid Mask Generation for Infrared Small Target Detection with Single-Point Supervision

Figure 3 for Hybrid Mask Generation for Infrared Small Target Detection with Single-Point Supervision

Figure 4 for Hybrid Mask Generation for Infrared Small Target Detection with Single-Point Supervision

Abstract:Single-frame infrared small target (SIRST) detection poses a significant challenge due to the requirement to discern minute targets amidst complex infrared background clutter. Recently, deep learning approaches have shown promising results in this domain. However, these methods heavily rely on extensive manual annotations, which are particularly cumbersome and resource-intensive for infrared small targets owing to their minute sizes. To address this limitation, we introduce a Hybrid Mask Generation (HMG) approach that recovers high-quality masks for each target from only a single-point label for network training. Specifically, our HMG approach consists of a handcrafted Points-to-Mask Generation strategy coupled with a pseudo mask updating strategy to recover and refine pseudo masks from point labels. The Points-to-Mask Generation strategy divides two distinct stages: Points-to-Box conversion, where individual point labels are transformed into bounding boxes, and subsequently, Box-to-Mask prediction, where these bounding boxes are elaborated into precise masks. The mask updating strategy integrates the complementary strengths of handcrafted and deep-learning algorithms to iteratively refine the initial pseudo masks. Experimental results across three datasets demonstrate that our method outperforms the existing methods for infrared small target detection with single-point supervision.

* 9 pages, 5 figures

Via

Access Paper or Ask Questions

Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning

Aug 22, 2024

Mushui Liu, Fangtai Wu, Bozheng Li, Ziqian Lu, Yunlong Yu, Xi Li

Figure 1 for Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning

Figure 2 for Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning

Figure 3 for Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning

Figure 4 for Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning

Abstract:Few-shot learning (FSL) aims to recognize new concepts using a limited number of visual samples. Existing approaches attempt to incorporate semantic information into the limited visual data for category understanding. However, these methods often enrich class-level feature representations with abstract category names, failing to capture the nuanced features essential for effective generalization. To address this issue, we propose a novel framework for FSL, which incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs), to enhance the representation of the class prototypes. Specifically, our framework composes a Semantic-guided Visual Pattern Extraction (SVPE) module and a Prototype-Calibration (PC) module, where the SVPE meticulously extracts semantic-aware visual patterns across diverse scales, while the PC module seamlessly integrates these patterns to refine the visual prototype, enhancing its representativeness. Extensive experiments on four few-shot classification benchmarks and the BSCD-FSL cross-domain benchmarks showcase remarkable advancements over the current state-of-the-art methods. Notably, for the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an impressive average improvement of 1.95% over the second-best competitor.

* 9 pages, 7 figures

Via

Access Paper or Ask Questions

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition

Aug 22, 2024

Bozheng Li, Mushui Liu, Gaoang Wang, Yunlong Yu

Abstract:In this paper, we propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot action recognition (FSAR), which incorporates a sequential perceiver adapter into the pre-training framework, to integrate both the spatial information and the sequential temporal dynamics into the feature embeddings. Different from the existing fine-tuning approaches that capture temporal information by exploring the relationships among all the frames, our perceiver-based adapter recurrently captures the sequential dynamics alongside the timeline, which could perceive the order change. To obtain the discriminative representations for each class, we extend a textual corpus for each class derived from the large language models (LLMs) and enrich the visual prototypes by integrating the contextual semantic information. Besides, We introduce an unbalanced optimal transport strategy for feature matching that mitigates the impact of class-unrelated features, thereby facilitating more effective decision-making. Experimental results on five FSAR datasets demonstrate that our method set a new benchmark, beating the second-best competitors with large margins.

* 9 pages, 6 figures

Via

Access Paper or Ask Questions

OmniCLIP: Adapting CLIP for Video Recognition with Spatial-Temporal Omni-Scale Feature Learning

Aug 12, 2024

Mushui Liu, Bozheng Li, Yunlong Yu

Abstract:Recent Vision-Language Models (VLMs) \textit{e.g.} CLIP have made great progress in video recognition. Despite the improvement brought by the strong visual backbone in extracting spatial features, CLIP still falls short in capturing and integrating spatial-temporal features which is essential for video recognition. In this paper, we propose OmniCLIP, a framework that adapts CLIP for video recognition by focusing on learning comprehensive features encompassing spatial, temporal, and dynamic spatial-temporal scales, which we refer to as omni-scale features. This is achieved through the design of spatial-temporal blocks that include parallel temporal adapters (PTA), enabling efficient temporal modeling. Additionally, we introduce a self-prompt generator (SPG) module to capture dynamic object spatial features. The synergy between PTA and SPG allows OmniCLIP to discern varying spatial information across frames and assess object scales over time. We have conducted extensive experiments in supervised video recognition, few-shot video recognition, and zero-shot recognition tasks. The results demonstrate the effectiveness of our method, especially with OmniCLIP achieving a top-1 accuracy of 74.30\% on HMDB51 in a 16-shot setting, surpassing the recent MotionPrompt approach even with full training data. The code is available at \url{https://github.com/XiaoBuL/OmniCLIP}.

* ECAI-2024

Via

Access Paper or Ask Questions