Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chao Dong

TurboFill: Adapting Few-step Text-to-image Model for Fast Image Inpainting

Apr 01, 2025

Liangbin Xie, Daniil Pakhomov, Zhonghao Wang, Zongze Wu, Ziyan Chen, Yuqian Zhou, Haitian Zheng, Zhifei Zhang, Zhe Lin, Jiantao Zhou(+1 more)

Abstract:This paper introduces TurboFill, a fast image inpainting model that enhances a few-step text-to-image diffusion model with an inpainting adapter for high-quality and efficient inpainting. While standard diffusion models generate high-quality results, they incur high computational costs. We overcome this by training an inpainting adapter on a few-step distilled text-to-image model, DMD2, using a novel 3-step adversarial training scheme to ensure realistic, structurally consistent, and visually harmonious inpainted regions. To evaluate TurboFill, we propose two benchmarks: DilationBench, which tests performance across mask sizes, and HumanBench, based on human feedback for complex prompts. Experiments show that TurboFill outperforms both multi-step BrushNet and few-step inpainting methods, setting a new benchmark for high-performance inpainting tasks. Our project page: https://liangbinxie.github.io/projects/TurboFill/

* Project webpage available at https://liangbinxie.github.io/projects/TurboFill/

Via

Access Paper or Ask Questions

UniCon: Unidirectional Information Flow for Effective Control of Large-Scale Diffusion Models

Mar 21, 2025

Fanghua Yu, Jinjin Gu, Jinfan Hu, Zheyuan Li, Chao Dong

Abstract:We introduce UniCon, a novel architecture designed to enhance control and efficiency in training adapters for large-scale diffusion models. Unlike existing methods that rely on bidirectional interaction between the diffusion model and control adapter, UniCon implements a unidirectional flow from the diffusion network to the adapter, allowing the adapter alone to generate the final output. UniCon reduces computational demands by eliminating the need for the diffusion model to compute and store gradients during adapter training. Our results indicate that UniCon reduces GPU memory usage by one-third and increases training speed by 2.3 times, while maintaining the same adapter parameter size. Additionally, without requiring extra computational resources, UniCon enables the training of adapters with double the parameter volume of existing ControlNets. In a series of image conditional generation tasks, UniCon has demonstrated precise responsiveness to control inputs and exceptional generation capabilities.

* This work has been accepted for publication at the International Conference on Learning Representations (ICLR) 2025

Via

Access Paper or Ask Questions

Joint ADS-B in B5G for Hierarchical UAV Networks: Performance Analysis and MEC Based Optimization

Mar 18, 2025

Chao Dong, Yiyang Liao, Ziye Jia, Qihui Wu, Lei Zhang

Abstract:Unmanned aerial vehicles (UAVs) play significant roles in multiple fields, which brings great challenges for the airspace safety. In order to achieve efficient surveillance and break the limitation of application scenarios caused by single communication, we propose the collaborative surveillance model for hierarchical UAVs based on the cooperation of automatic dependent surveillance-broadcast (ADS-B) and 5G. Specifically, UAVs are hierarchical deployed, with the low-altitude central UAV equipped with the 5G module, and the high-altitude central UAV with ADS-B, which helps automatically broadcast the flight information to surrounding aircraft and ground stations. Firstly, we build the framework, derive the analytic expression, and analyze the channel performance of both air-to-ground (A2G) and air-to-air (A2A). Then, since the redundancy or information loss during transmission aggravates the monitoring performance, the mobile edge computing (MEC) based on-board processing algorithm is proposed. Finally, the performances of the proposed model and algorithm are verified through both simulations and experiments. In detail, the redundant data filtered out by the proposed algorithm accounts for 53.48%, and the supplementary data accounts for 16.42% of the optimized data. This work designs a UAV monitoring framework and proposes an algorithm to enhance the observability of trajectory surveillance, which helps improve the airspace safety and enhance the air traffic flow management.

Via

Access Paper or Ask Questions

Revisiting the Generalization Problem of Low-level Vision Models Through the Lens of Image Deraining

Feb 18, 2025

Jinfan Hu, Zhiyuan You, Jinjin Gu, Kaiwen Zhu, Tianfan Xue, Chao Dong

Abstract:Generalization remains a significant challenge for low-level vision models, which often struggle with unseen degradations in real-world scenarios despite their success in controlled benchmarks. In this paper, we revisit the generalization problem in low-level vision models. Image deraining is selected as a case study due to its well-defined and easily decoupled structure, allowing for more effective observation and analysis. Through comprehensive experiments, we reveal that the generalization issue is not primarily due to limited network capacity but rather the failure of existing training strategies, which leads networks to overfit specific degradation patterns. Our findings show that guiding networks to focus on learning the underlying image content, rather than the degradation patterns, is key to improving generalization. We demonstrate that balancing the complexity of background images and degradations in the training data helps networks better fit the image distribution. Furthermore, incorporating content priors from pre-trained generative models significantly enhances generalization. Experiments on both image deraining and image denoising validate the proposed strategies. We believe the insights and solutions will inspire further research and improve the generalization of low-level vision models.

* arXiv admin note: substantial text overlap with arXiv:2305.15134

Via

Access Paper or Ask Questions

Generative AI-Enhanced Cooperative MEC of UAVs and Ground Stations for Unmanned Surface Vehicles

Feb 12, 2025

Jiahao You, Ziye Jia, Chao Dong, Qihui Wu, Zhu Han

Abstract:The increasing deployment of unmanned surface vehicles (USVs) require computational support and coverage in applications such as maritime search and rescue. Unmanned aerial vehicles (UAVs) can offer low-cost, flexible aerial services, and ground stations (GSs) can provide powerful supports, which can cooperate to help the USVs in complex scenarios. However, the collaboration between UAVs and GSs for USVs faces challenges of task uncertainties, USVs trajectory uncertainties, heterogeneities, and limited computational resources. To address these issues, we propose a cooperative UAV and GS based robust multi-access edge computing framework to assist USVs in completing computational tasks. Specifically, we formulate the optimization problem of joint task offloading and UAV trajectory to minimize the total execution time, which is in the form of mixed integer nonlinear programming and NP-hard to tackle. Therefore, we propose the algorithm of generative artificial intelligence-enhanced heterogeneous agent proximal policy optimization (GAI-HAPPO). The proposed algorithm integrates GAI models to enhance the actor network ability to model complex environments and extract high-level features, thereby allowing the algorithm to predict uncertainties and adapt to dynamic conditions. Additionally, GAI stabilizes the critic network, addressing the instability of multi-agent reinforcement learning approaches. Finally, extensive simulations demonstrate that the proposed algorithm outperforms the existing benchmark methods, thus highlighting the potentials in tackling intricate, cross-domain issues in the considered scenarios.

Via

Access Paper or Ask Questions

Teaching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution

Jan 20, 2025

Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, Chao Dong

Figure 1 for Teaching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution

Figure 2 for Teaching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution

Figure 3 for Teaching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution

Figure 4 for Teaching Large Language Models to Regress Accurate Image Quality Scores using Score Distribution

Abstract:With the rapid advancement of Multi-modal Large Language Models (MLLMs), MLLM-based Image Quality Assessment (IQA) methods have shown promising performance in linguistic quality description. However, current methods still fall short in accurately scoring image quality. In this work, we aim to leverage MLLMs to regress accurate quality scores. A key challenge is that the quality score is inherently continuous, typically modeled as a Gaussian distribution, whereas MLLMs generate discrete token outputs. This mismatch necessitates score discretization. Previous approaches discretize the mean score into a one-hot label, resulting in information loss and failing to capture inter-image relationships. We propose a distribution-based approach that discretizes the score distribution into a soft label. This method preserves the characteristics of the score distribution, achieving high accuracy and maintaining inter-image relationships. Moreover, to address dataset variation, where different IQA datasets exhibit various distributions, we introduce a fidelity loss based on Thurstone's model. This loss captures intra-dataset relationships, facilitating co-training across multiple IQA datasets. With these designs, we develop the distribution-based Depicted image Quality Assessment model for Score regression (DeQA-Score). Experiments across multiple benchmarks show that DeQA-Score stably outperforms baselines in score regression. Also, DeQA-Score can predict the score distribution that closely aligns with human annotations. Codes and model weights have been released in https://depictqa.github.io/deqa-score/.

Via

Access Paper or Ask Questions

An Intelligent Agentic System for Complex Image Restoration Problems

Oct 23, 2024

Kaiwen Zhu, Jinjin Gu, Zhiyuan You, Yu Qiao, Chao Dong

Figure 1 for An Intelligent Agentic System for Complex Image Restoration Problems

Figure 2 for An Intelligent Agentic System for Complex Image Restoration Problems

Figure 3 for An Intelligent Agentic System for Complex Image Restoration Problems

Figure 4 for An Intelligent Agentic System for Complex Image Restoration Problems

Abstract:Real-world image restoration (IR) is inherently complex and often requires combining multiple specialized models to address diverse degradations. Inspired by human problem-solving, we propose AgenticIR, an agentic system that mimics the human approach to image processing by following five key stages: Perception, Scheduling, Execution, Reflection, and Rescheduling. AgenticIR leverages large language models (LLMs) and vision-language models (VLMs) that interact via text generation to dynamically operate a toolbox of IR models. We fine-tune VLMs for image quality analysis and employ LLMs for reasoning, guiding the system step by step. To compensate for LLMs' lack of specific IR knowledge and experience, we introduce a self-exploration method, allowing the LLM to observe and summarize restoration results into referenceable documents. Experiments demonstrate AgenticIR's potential in handling complex IR tasks, representing a promising path toward achieving general intelligence in visual processing.

Via

Access Paper or Ask Questions

A Preliminary Exploration Towards General Image Restoration

Aug 27, 2024

Xiangtao Kong, Jinjin Gu, Yihao Liu, Wenlong Zhang, Xiangyu Chen, Yu Qiao, Chao Dong

Figure 1 for A Preliminary Exploration Towards General Image Restoration

Figure 2 for A Preliminary Exploration Towards General Image Restoration

Figure 3 for A Preliminary Exploration Towards General Image Restoration

Figure 4 for A Preliminary Exploration Towards General Image Restoration

Abstract:Despite the tremendous success of deep models in various individual image restoration tasks, there are at least two major technical challenges preventing these works from being applied to real-world usages: (1) the lack of generalization ability and (2) the complex and unknown degradations in real-world scenarios. Existing deep models, tailored for specific individual image restoration tasks, often fall short in effectively addressing these challenges. In this paper, we present a new problem called general image restoration (GIR) which aims to address these challenges within a unified model. GIR covers most individual image restoration tasks (\eg, image denoising, deblurring, deraining and super-resolution) and their combinations for general purposes. This paper proceeds to delineate the essential aspects of GIR, including problem definition and the overarching significance of generalization performance. Moreover, the establishment of new datasets and a thorough evaluation framework for GIR models is discussed. We conduct a comprehensive evaluation of existing approaches for tackling the GIR challenge, illuminating their strengths and pragmatic challenges. By analyzing these approaches, we not only underscore the effectiveness of GIR but also highlight the difficulties in its practical implementation. At last, we also try to understand and interpret these models' behaviors to inspire the future direction. Our work can open up new valuable research directions and contribute to the research of general vision.

Via

Access Paper or Ask Questions

Learning A Low-Level Vision Generalist via Visual Task Prompt

Aug 16, 2024

Xiangyu Chen, Yihao Liu, Yuandong Pu, Wenlong Zhang, Jiantao Zhou, Yu Qiao, Chao Dong

Abstract:Building a unified model for general low-level vision tasks holds significant research and practical value. Current methods encounter several critical issues. Multi-task restoration approaches can address multiple degradation-to-clean restoration tasks, while their applicability to tasks with different target domains (e.g., image stylization) is limited. Methods like PromptGIP can handle multiple input-target domains but rely on the Masked Autoencoder (MAE) paradigm. Consequently, they are tied to the ViT architecture, resulting in suboptimal image reconstruction quality. In addition, these methods are sensitive to prompt image content and often struggle with low-frequency information processing. In this paper, we propose a Visual task Prompt-based Image Processing (VPIP) framework to overcome these challenges. VPIP employs visual task prompts to manage tasks with different input-target domains and allows flexible selection of backbone network suitable for general tasks. Besides, a new prompt cross-attention is introduced to facilitate interaction between the input and prompt information. Based on the VPIP framework, we train a low-level vision generalist model, namely GenLV, on 30 diverse tasks. Experimental results show that GenLV can successfully address a variety of low-level tasks, significantly outperforming existing methods both quantitatively and qualitatively. Codes are available at https://github.com/chxy95/GenLV.

* Accepted to ACMMM24

Via

Access Paper or Ask Questions

Interpreting Low-level Vision Models with Causal Effect Maps

Jul 29, 2024

Jinfan Hu, Jinjin Gu, Shiyao Yu, Fanghua Yu, Zheyuan Li, Zhiyuan You, Chaochao Lu, Chao Dong

Figure 1 for Interpreting Low-level Vision Models with Causal Effect Maps

Figure 2 for Interpreting Low-level Vision Models with Causal Effect Maps

Figure 3 for Interpreting Low-level Vision Models with Causal Effect Maps

Figure 4 for Interpreting Low-level Vision Models with Causal Effect Maps

Abstract:Deep neural networks have significantly improved the performance of low-level vision tasks but also increased the difficulty of interpretability. A deep understanding of deep models is beneficial for both network design and practical reliability. To take up this challenge, we introduce causality theory to interpret low-level vision models and propose a model-/task-agnostic method called Causal Effect Map (CEM). With CEM, we can visualize and quantify the input-output relationships on either positive or negative effects. After analyzing various low-level vision tasks with CEM, we have reached several interesting insights, such as: (1) Using more information of input images (e.g., larger receptive field) does NOT always yield positive outcomes. (2) Attempting to incorporate mechanisms with a global receptive field (e.g., channel attention) into image denoising may prove futile. (3) Integrating multiple tasks to train a general model could encourage the network to prioritize local information over global context. Based on the causal effect theory, the proposed diagnostic tool can refresh our common knowledge and bring a deeper understanding of low-level vision models. Codes are available at https://github.com/J-FHu/CEM.

Via

Access Paper or Ask Questions