Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiaqi Xu

Dual Prompting Image Restoration with Diffusion Transformers

Apr 24, 2025

Dehong Kong, Fan Li, Zhixin Wang, Jiaqi Xu, Renjing Pei, Wenbo Li, WenQi Ren

Abstract:Recent state-of-the-art image restoration methods mostly adopt latent diffusion models with U-Net backbones, yet still facing challenges in achieving high-quality restoration due to their limited capabilities. Diffusion transformers (DiTs), like SD3, are emerging as a promising alternative because of their better quality with scalability. In this paper, we introduce DPIR (Dual Prompting Image Restoration), a novel image restoration method that effectivly extracts conditional information of low-quality images from multiple perspectives. Specifically, DPIR consits of two branches: a low-quality image conditioning branch and a dual prompting control branch. The first branch utilizes a lightweight module to incorporate image priors into the DiT with high efficiency. More importantly, we believe that in image restoration, textual description alone cannot fully capture its rich visual characteristics. Therefore, a dual prompting module is designed to provide DiT with additional visual cues, capturing both global context and local appearance. The extracted global-local visual prompts as extra conditional control, alongside textual prompts to form dual prompts, greatly enhance the quality of the restoration. Extensive experimental results demonstrate that DPIR delivers superior image restoration performance.

* CVPR2025

Via

Access Paper or Ask Questions

PsyCounAssist: A Full-Cycle AI-Powered Psychological Counseling Assistant System

Apr 23, 2025

Xianghe Liu, Jiaqi Xu, Tao Sun

Abstract:Psychological counseling is a highly personalized and dynamic process that requires therapists to continuously monitor emotional changes, document session insights, and maintain therapeutic continuity. In this paper, we introduce PsyCounAssist, a comprehensive AI-powered counseling assistant system specifically designed to augment psychological counseling practices. PsyCounAssist integrates multimodal emotion recognition combining speech and photoplethysmography (PPG) signals for accurate real-time affective analysis, automated structured session reporting using large language models (LLMs), and personalized AI-generated follow-up support. Deployed on Android-based tablet devices, the system demonstrates practical applicability and flexibility in real-world counseling scenarios. Experimental evaluation confirms the reliability of PPG-based emotional classification and highlights the system's potential for non-intrusive, privacy-aware emotional support. PsyCounAssist represents a novel approach to ethically and effectively integrating AI into psychological counseling workflows.

Via

Access Paper or Ask Questions

Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis

Apr 20, 2025

Jingjing Ren, Wenbo Li, Zhongdao Wang, Haoze Sun, Bangzhen Liu, Haoyu Chen, Jiaqi Xu, Aoxue Li, Shifeng Zhang, Bin Shao(+2 more)

Abstract:Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals. While diffusion transformers (DiTs) have demonstrated remarkable capabilities in high-quality video generation, scaling them to 2K resolution remains computationally prohibitive due to quadratic growth in memory and processing costs. In this work, we propose Turbo2K, an efficient and practical framework for generating detail-rich 2K videos while significantly improving training and inference efficiency. First, Turbo2K operates in a highly compressed latent space, reducing computational complexity and memory footprint, making high-resolution video synthesis feasible. However, the high compression ratio of the VAE and limited model size impose constraints on generative quality. To mitigate this, we introduce a knowledge distillation strategy that enables a smaller student model to inherit the generative capacity of a larger, more powerful teacher model. Our analysis reveals that, despite differences in latent spaces and architectures, DiTs exhibit structural similarities in their internal representations, facilitating effective knowledge transfer. Second, we design a hierarchical two-stage synthesis framework that first generates multi-level feature at lower resolutions before guiding high-resolution video generation. This approach ensures structural coherence and fine-grained detail refinement while eliminating redundant encoding-decoding overhead, further enhancing computational efficiency.Turbo2K achieves state-of-the-art efficiency, generating 5-second, 24fps, 2K videos with significantly reduced computational cost. Compared to existing methods, Turbo2K is up to 20$\times$ faster for inference, making high-resolution video generation more scalable and practical for real-world applications.

* Webpage at https://jingjingrenabc.github.io/turbo2k/

Via

Access Paper or Ask Questions

MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models

Feb 28, 2025

Qiao Yan, Yuchen Yuan, Xiaowei Hu, Yihan Wang, Jiaqi Xu, Jinpeng Li, Chi-Wing Fu, Pheng-Ann Heng

Abstract:The increasing use of vision-language models (VLMs) in healthcare applications presents great challenges related to hallucinations, in which the models may generate seemingly plausible results that are in fact incorrect. Such hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. In this work, we propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. Comprising over 100,000 images and 1,000,000 instruction pairs, MedHallTune includes both hallucination and non-hallucination samples, each with ground-truth annotations. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level. The experimental results show that fine-tuning with MedHallTune successfully improves the ability of several existing models to manage hallucinations and boost their zero-shot performance on downstream visual-question-answering (VQA) tasks, making them more reliable for practical medical applications. Our work contributes to the development of more trustworthy VLMs. Codes and dataset will be available at \href{https://github.com/russellyq/MedHallTune}{MedHallTune}.

Via

Access Paper or Ask Questions

CoMA: Compositional Human Motion Generation with Multi-modal Agents

Dec 10, 2024

Shanlin Sun, Gabriel De Araujo, Jiaqi Xu, Shenghan Zhou, Hanwen Zhang, Ziheng Huang, Chenyu You, Xiaohui Xie

Figure 1 for CoMA: Compositional Human Motion Generation with Multi-modal Agents

Figure 2 for CoMA: Compositional Human Motion Generation with Multi-modal Agents

Figure 3 for CoMA: Compositional Human Motion Generation with Multi-modal Agents

Figure 4 for CoMA: Compositional Human Motion Generation with Multi-modal Agents

Abstract:3D human motion generation has seen substantial advancement in recent years. While state-of-the-art approaches have improved performance significantly, they still struggle with complex and detailed motions unseen in training data, largely due to the scarcity of motion datasets and the prohibitive cost of generating new training examples. To address these challenges, we introduce CoMA, an agent-based solution for complex human motion generation, editing, and comprehension. CoMA leverages multiple collaborative agents powered by large language and vision models, alongside a mask transformer-based motion generator featuring body part-specific encoders and codebooks for fine-grained control. Our framework enables generation of both short and long motion sequences with detailed instructions, text-guided motion editing, and self-correction for improved quality. Evaluations on the HumanML3D dataset demonstrate competitive performance against state-of-the-art methods. Additionally, we create a set of context-rich, compositional, and long text prompts, where user studies show our method significantly outperforms existing approaches.

* Project Page: https://gabrie-l.github.io/coma-page/

Via

Access Paper or Ask Questions

CAPA: Continuous-Aperture Arrays for Revolutionizing 6G Wireless Communications

Dec 01, 2024

Yuanwei Liu, Chongjun Ouyang, Zhaolin Wang, Jiaqi Xu, Xidong Mu, Zhiguo Ding

Figure 1 for CAPA: Continuous-Aperture Arrays for Revolutionizing 6G Wireless Communications

Figure 2 for CAPA: Continuous-Aperture Arrays for Revolutionizing 6G Wireless Communications

Figure 3 for CAPA: Continuous-Aperture Arrays for Revolutionizing 6G Wireless Communications

Figure 4 for CAPA: Continuous-Aperture Arrays for Revolutionizing 6G Wireless Communications

Abstract:In this paper, a novel continuous-aperture array (CAPA)-based wireless communication architecture is proposed, which relies on an electrically large aperture with a continuous current distribution. First, an existing prototype of CAPA is reviewed, followed by the potential benefits and key motivations for employing CAPAs in wireless communications. Then, three practical hardware implementation approaches for CAPAs are introduced based on electronic, optical, and acoustic materials. Furthermore, several beamforming approaches are proposed to optimize the continuous current distributions of CAPAs, which are fundamentally different from those used for conventional spatially discrete arrays (SPDAs). Numerical results are provided to demonstrate their key features in low complexity and near-optimality. Based on these proposed approaches, the performance gains of CAPAs over SPDAs are revealed in terms of channel capacity as well as diversity-multiplexing gains. Finally, several open research problems in CAPA are highlighted.

* 8 pages, 4 figures, 2 tables

Via

Access Paper or Ask Questions

Non-Reciprocal Reconfigurable Intelligent Surfaces

Nov 23, 2024

Jiaqi Xu, Haoyu Wang, Rang Liu, Josef A. Nossek, A. Lee Swindlehurst

Figure 1 for Non-Reciprocal Reconfigurable Intelligent Surfaces

Figure 2 for Non-Reciprocal Reconfigurable Intelligent Surfaces

Figure 3 for Non-Reciprocal Reconfigurable Intelligent Surfaces

Figure 4 for Non-Reciprocal Reconfigurable Intelligent Surfaces

Abstract:In contrast to conventional RIS, the scattering matrix of a non-reciprocal RIS (NR-RIS) is non-symmetric, leading to differences in the uplink and the downlink components of NR-RIS cascaded channels. In this paper, a physically-consistent device model is proposed in which an NR-RIS is composed of multiple groups of two-port elements inter-connected by non-reciprocal devices. The resulting non-reciprocal scattering matrix is derived for various cases including two-element groups connected with isolators or gyrators, and general three-element groups connected via circulators. Signal models are given for NR-RIS operating in either reflecting-only or simultaneously transmitting and reflecting modes. The problem of NR-RIS design for non-reciprocal beamsteering is formulated for three-element circulator implementations, and numerical results confirm that non-reciprocal beamsteering can be achieved with minimal sidelobe power. We also show that our physically consistent NR-RIS architecture is effective in implementing channel reciprocity attacks, achieving similar performance to that with idealized NR-RIS models.

* 13 Pages single Column

Via

Access Paper or Ask Questions

InterFormer: Towards Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction

Nov 15, 2024

Zhichen Zeng, Xiaolong Liu, Mengyue Hang, Xiaoyi Liu, Qinghai Zhou, Chaofei Yang, Yiqun Liu, Yichen Ruan, Laming Chen, Yuxin Chen(+16 more)

Figure 1 for InterFormer: Towards Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction

Figure 2 for InterFormer: Towards Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction

Figure 3 for InterFormer: Towards Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction

Figure 4 for InterFormer: Towards Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction

Abstract:Click-through rate (CTR) prediction, which predicts the probability of a user clicking an ad, is a fundamental task in recommender systems. The emergence of heterogeneous information, such as user profile and behavior sequences, depicts user interests from different aspects. A mutually beneficial integration of heterogeneous information is the cornerstone towards the success of CTR prediction. However, most of the existing methods suffer from two fundamental limitations, including (1) insufficient inter-mode interaction due to the unidirectional information flow between modes, and (2) aggressive information aggregation caused by early summarization, resulting in excessive information loss. To address the above limitations, we propose a novel module named InterFormer to learn heterogeneous information interaction in an interleaving style. To achieve better interaction learning, InterFormer enables bidirectional information flow for mutually beneficial learning across different modes. To avoid aggressive information aggregation, we retain complete information in each data mode and use a separate bridging arch for effective information selection and summarization. Our proposed InterFormer achieves state-of-the-art performance on three public datasets and a large-scale industrial dataset.

* 10 pages, 6 figures

Via

Access Paper or Ask Questions

Meta-Learning-Driven Adaptive Codebook Design for Near-Field Communications

Oct 10, 2024

Mianyi Zhang, Yunlong Cai, Jiaqi Xu, A. Lee Swindlehurst

Figure 1 for Meta-Learning-Driven Adaptive Codebook Design for Near-Field Communications

Figure 2 for Meta-Learning-Driven Adaptive Codebook Design for Near-Field Communications

Figure 3 for Meta-Learning-Driven Adaptive Codebook Design for Near-Field Communications

Figure 4 for Meta-Learning-Driven Adaptive Codebook Design for Near-Field Communications

Abstract:Extremely large-scale arrays (XL-arrays) and ultra-high frequencies are two key technologies for sixth-generation (6G) networks, offering higher system capacity and expanded bandwidth resources. To effectively combine these technologies, it is necessary to consider the near-field spherical-wave propagation model, rather than the traditional far-field planar-wave model. In this paper, we explore a near-field communication system comprising a base station (BS) with hybrid analog-digital beamforming and multiple mobile users. Our goal is to maximize the system's sum-rate by optimizing the near-field codebook design for hybrid precoding. To enable fast adaptation to varying user distributions, we propose a meta-learning-based framework that integrates the model-agnostic meta-learning (MAML) algorithm with a codebook learning network. Specifically, we first design a deep neural network (DNN) to learn the near-field codebook. Then, we combine the MAML algorithm with the DNN to allow rapid adaptation to different channel conditions by leveraging a well-initialized model from the outer network. Simulation results demonstrate that our proposed framework outperforms conventional algorithms, offering improved generalization and better overall performance.

Via

Access Paper or Ask Questions

Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models

Sep 03, 2024

Jiaqi Xu, Mengyang Wu, Xiaowei Hu, Chi-Wing Fu, Qi Dou, Pheng-Ann Heng

Figure 1 for Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models

Figure 2 for Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models

Figure 3 for Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models

Figure 4 for Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models

Abstract:This paper addresses the limitations of adverse weather image restoration approaches trained on synthetic data when applied to real-world scenarios. We formulate a semi-supervised learning framework employing vision-language models to enhance restoration performance across diverse adverse weather conditions in real-world settings. Our approach involves assessing image clearness and providing semantics using vision-language models on real data, serving as supervision signals for training restoration models. For clearness enhancement, we use real-world data, utilizing a dual-step strategy with pseudo-labels assessed by vision-language models and weather prompt learning. For semantic enhancement, we integrate real-world data by adjusting weather conditions in vision-language model descriptions while preserving semantic meaning. Additionally, we introduce an effective training strategy to bootstrap restoration performance. Our approach achieves superior results in real-world adverse weather image restoration, demonstrated through qualitative and quantitative comparisons with state-of-the-art works.

* Accepted by ECCV 2024

Via

Access Paper or Ask Questions