Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Kai Wu

Sid

BaseReward: A Strong Baseline for Multimodal Reward Model

Sep 19, 2025

Yi-Fan Zhang, Haihua Yang, Huanyu Zhang, Yang Shi, Zezhou Chen, Haochen Tian, Chaoyou Fu, Haotian Wang, Kai Wu, Bo Cui(+4 more)

Figure 1 for BaseReward: A Strong Baseline for Multimodal Reward Model

Figure 2 for BaseReward: A Strong Baseline for Multimodal Reward Model

Figure 3 for BaseReward: A Strong Baseline for Multimodal Reward Model

Figure 4 for BaseReward: A Strong Baseline for Multimodal Reward Model

Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has made aligning them with human preferences a critical challenge. Reward Models (RMs) are a core technology for achieving this goal, but a systematic guide for building state-of-the-art Multimodal Reward Models (MRMs) is currently lacking in both academia and industry. Through exhaustive experimental analysis, this paper aims to provide a clear ``recipe'' for constructing high-performance MRMs. We systematically investigate every crucial component in the MRM development pipeline, including \textit{reward modeling paradigms} (e.g., Naive-RM, Critic-based RM, and Generative RM), \textit{reward head architecture}, \textit{training strategies}, \textit{data curation} (covering over ten multimodal and text-only preference datasets), \textit{backbone model} and \textit{model scale}, and \textit{ensemble methods}. Based on these experimental insights, we introduce \textbf{BaseReward}, a powerful and efficient baseline for multimodal reward modeling. BaseReward adopts a simple yet effective architecture, built upon a {Qwen2.5-VL} backbone, featuring an optimized two-layer reward head, and is trained on a carefully curated mixture of high-quality multimodal and text-only preference data. Our results show that BaseReward establishes a new SOTA on major benchmarks such as MM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench, outperforming previous models. Furthermore, to validate its practical utility beyond static benchmarks, we integrate BaseReward into a real-world reinforcement learning pipeline, successfully enhancing an MLLM's performance across various perception, reasoning, and conversational tasks. This work not only delivers a top-tier MRM but, more importantly, provides the community with a clear, empirically-backed guide for developing robust reward models for the next generation of MLLMs.

Via

Access Paper or Ask Questions

Towards SISO Bistatic Sensing for ISAC

Aug 18, 2025

Zhongqin Wang, J. Andrew Zhang, Kai Wu, Min Xu, Y. Jay Guo

Figure 1 for Towards SISO Bistatic Sensing for ISAC

Figure 2 for Towards SISO Bistatic Sensing for ISAC

Figure 3 for Towards SISO Bistatic Sensing for ISAC

Figure 4 for Towards SISO Bistatic Sensing for ISAC

Abstract:Integrated Sensing and Communication (ISAC) is a key enabler for next-generation wireless systems. However, real-world deployment is often limited to low-cost, single-antenna transceivers. In such bistatic Single-Input Single-Output (SISO) setup, clock asynchrony introduces random phase offsets in Channel State Information (CSI), which cannot be mitigated using conventional multi-antenna methods. This work proposes WiDFS 3.0, a lightweight bistatic SISO sensing framework that enables accurate delay and Doppler estimation from distorted CSI by effectively suppressing Doppler mirroring ambiguity. It operates with only a single antenna at both the transmitter and receiver, making it suitable for low-complexity deployments. We propose a self-referencing cross-correlation (SRCC) method for SISO random phase removal and employ delay-domain beamforming to resolve Doppler ambiguity. The resulting unambiguous delay-Doppler-time features enable robust sensing with compact neural networks. Extensive experiments show that WiDFS 3.0 achieves accurate parameter estimation, with performance comparable to or even surpassing that of prior multi-antenna methods, especially in delay estimation. Validated under single- and multi-target scenarios, the extracted ambiguity-resolved features show strong sensing accuracy and generalization. For example, when deployed on the embedded-friendly MobileViT-XXS with only 1.3M parameters, WiDFS 3.0 consistently outperforms conventional features such as CSI amplitude, mirrored Doppler, and multi-receiver aggregated Doppler.

Via

Access Paper or Ask Questions

Water Level Sensing via Communication Signals in a Bi-Static System

May 26, 2025

Zhongqin Wang, J. Andrew Zhang, Kai Wu, Y. Jay Guo

Figure 1 for Water Level Sensing via Communication Signals in a Bi-Static System

Figure 2 for Water Level Sensing via Communication Signals in a Bi-Static System

Figure 3 for Water Level Sensing via Communication Signals in a Bi-Static System

Figure 4 for Water Level Sensing via Communication Signals in a Bi-Static System

Abstract:Accurate water level sensing is essential for flood monitoring, agricultural irrigation, and water resource optimization. Traditional methods require dedicated sensor deployments, leading to high installation costs, vulnerability to interference, and limited resolution. This work proposes PMNs-WaterSense, a novel scheme leveraging Channel State Information (CSI) from existing mobile networks for water level sensing. Our scheme begins with a CSI-power method to eliminate phase offsets caused by clock asynchrony in bi-static systems. We then apply multi-domain filtering across the time (Doppler), frequency (delay), and spatial (Angle-of-Arrival, AoA) domains to extract phase features that finely capture variations in path length over water. To resolve the $2\pi$ phase ambiguity, we introduce a Kalman filter-based unwrapping technique. Additionally, we exploit transceiver geometry to convert path length variations into water level height changes, even with limited antenna configurations. We validate our framework through controlled experiments with 28 GHz mmWave and 3.1 GHz LTE signals in real time, achieving average height estimation errors of 0.025 cm and 0.198 cm, respectively. Moreover, real-world river monitoring with 2.6 GHz LTE signals achieves an average error of 4.8 cm for a 1-meter water level change, demonstrating its effectiveness in practical deployments.

Via

Access Paper or Ask Questions

Bayesian Sensing for Time-Varying Channels in ISAC Systems

Apr 21, 2025

Xueyang Wang, Kai Wu, J. Andrew Zhang, Shiqi Gong, Chengwen Xing

Figure 1 for Bayesian Sensing for Time-Varying Channels in ISAC Systems

Figure 2 for Bayesian Sensing for Time-Varying Channels in ISAC Systems

Figure 3 for Bayesian Sensing for Time-Varying Channels in ISAC Systems

Figure 4 for Bayesian Sensing for Time-Varying Channels in ISAC Systems

Abstract:Future mobile networks are projected to support integrated sensing and communications in high-speed communication scenarios. Nevertheless, large Doppler shifts induced by time-varying channels may cause severe inter-carrier interference (ICI). Frequency domain shows the potential of reducing ISAC complexity as compared with other domains. However, parameter mismatching issue still exists for such sensing. In this paper, we develop a novel sensing scheme based on sparse Bayesian framework, where the delay and Doppler estimation problem in time-varying channels is formulated as a 3D multiple measurement-sparse signal recovery (MM-SSR) problem. We then propose a novel two-layer variational Bayesian inference (VBI) method to decompose the 3D MM-SSR problem into two layers and estimate the Doppler in the first layer and the delay in the second layer alternatively. Subsequently, as is benefited from newly unveiled signal construction, a simplified two-stage multiple signal classification (MUSIC)-based VBI method is proposed, where the delay and the Doppler are estimated by MUSIC and VBI, respectively. Additionally, the Cram\'er-Rao bound (CRB) of the considered sensing parameters is derived to characterize the lower bound for the proposed estimators. Corroborated by extensive simulation results, our proposed method can achieve improved mean square error (MSE) than its conventional counterparts and is robust against the target number and target speed, thereby validating its wide applicability and advantages over prior arts.

* 14 pages, 8 figures, manuscript submitted to IEEE Transactions on Communications (TCOM)

Via

Access Paper or Ask Questions

TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

Apr 01, 2025

Zhiming Ma, Peidong Wang, Minhua Huang, Jingpeng Wang, Kai Wu, Xiangzhao Lv, Yachun Pang, Yin Yang, Wenjie Tang, Yuchen Kang

Figure 1 for TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

Figure 2 for TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

Figure 3 for TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

Figure 4 for TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

Abstract:The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.

Via

Access Paper or Ask Questions

Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation

Feb 21, 2025

Wenxuan Wang, Kai Wu, Yujian Betterest Li, Dan Wang, Xiaoyu Zhang, Jing Liu

Figure 1 for Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation

Figure 2 for Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation

Figure 3 for Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation

Figure 4 for Mitigating Data Scarcity in Time Series Analysis: A Foundation Model with Series-Symbol Data Generation

Abstract:Foundation models for time series analysis (TSA) have attracted significant attention. However, challenges such as data scarcity and data imbalance continue to hinder their development. To address this, we consider modeling complex systems through symbolic expressions that serve as semantic descriptors of time series. Building on this concept, we introduce a series-symbol (S2) dual-modulity data generation mechanism, enabling the unrestricted creation of high-quality time series data paired with corresponding symbolic representations. Leveraging the S2 dataset, we develop SymTime, a pre-trained foundation model for TSA. SymTime demonstrates competitive performance across five major TSA tasks when fine-tuned with downstream task, rivaling foundation models pre-trained on real-world datasets. This approach underscores the potential of dual-modality data generation and pretraining mechanisms in overcoming data scarcity and enhancing task performance.

Via

Access Paper or Ask Questions

Enhancing Brain Tumor Segmentation Using Channel Attention and Transfer learning

Jan 19, 2025

Majid Behzadpour, Ebrahim Azizi, Kai Wu, Bengie L. Ortiz

Figure 1 for Enhancing Brain Tumor Segmentation Using Channel Attention and Transfer learning

Figure 2 for Enhancing Brain Tumor Segmentation Using Channel Attention and Transfer learning

Figure 3 for Enhancing Brain Tumor Segmentation Using Channel Attention and Transfer learning

Figure 4 for Enhancing Brain Tumor Segmentation Using Channel Attention and Transfer learning

Abstract:Accurate and efficient segmentation of brain tumors is critical for diagnosis, treatment planning, and monitoring in clinical practice. In this study, we present an enhanced ResUNet architecture for automatic brain tumor segmentation, integrating an EfficientNetB0 encoder, a channel attention mechanism, and an Atrous Spatial Pyramid Pooling (ASPP) module. The EfficientNetB0 encoder leverages pre-trained features to improve feature extraction efficiency, while the channel attention mechanism enhances the model's focus on tumor-relevant features. ASPP enables multiscale contextual learning, crucial for handling tumors of varying sizes and shapes. The proposed model was evaluated on two benchmark datasets: TCGA LGG and BraTS 2020. Experimental results demonstrate that our method consistently outperforms the baseline ResUNet and its EfficientNet variant, achieving Dice coefficients of 0.903 and 0.851 and HD95 scores of 9.43 and 3.54 for whole tumor and tumor core regions on the BraTS 2020 dataset, respectively. compared with state-of-the-art methods, our approach shows competitive performance, particularly in whole tumor and tumor core segmentation. These results indicate that combining a powerful encoder with attention mechanisms and ASPP can significantly enhance brain tumor segmentation performance. The proposed approach holds promise for further optimization and application in other medical image segmentation tasks.

* 13 pages, 1 figure

Via

Access Paper or Ask Questions

AutoSGNN: Automatic Propagation Mechanism Discovery for Spectral Graph Neural Networks

Dec 17, 2024

Shibing Mo, Kai Wu, Qixuan Gao, Xiangyi Teng, Jing Liu

Figure 1 for AutoSGNN: Automatic Propagation Mechanism Discovery for Spectral Graph Neural Networks

Figure 2 for AutoSGNN: Automatic Propagation Mechanism Discovery for Spectral Graph Neural Networks

Figure 3 for AutoSGNN: Automatic Propagation Mechanism Discovery for Spectral Graph Neural Networks

Figure 4 for AutoSGNN: Automatic Propagation Mechanism Discovery for Spectral Graph Neural Networks

Abstract:In real-world applications, spectral Graph Neural Networks (GNNs) are powerful tools for processing diverse types of graphs. However, a single GNN often struggles to handle different graph types-such as homogeneous and heterogeneous graphs-simultaneously. This challenge has led to the manual design of GNNs tailored to specific graph types, but these approaches are limited by the high cost of labor and the constraints of expert knowledge, which cannot keep up with the rapid growth of graph data. To overcome these challenges, we propose AutoSGNN, an automated framework for discovering propagation mechanisms in spectral GNNs. AutoSGNN unifies the search space for spectral GNNs by integrating large language models with evolutionary strategies to automatically generate architectures that adapt to various graph types. Extensive experiments on nine widely-used datasets, encompassing both homophilic and heterophilic graphs, demonstrate that AutoSGNN outperforms state-of-the-art spectral GNNs and graph neural architecture search methods in both performance and efficiency.

Via

Access Paper or Ask Questions

Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration

Dec 05, 2024

Yuzhen Du, Teng Hu, Jiangning Zhang, Ran Yi Chengming Xu, Xiaobin Hu, Kai Wu, Donghao Luo, Yabiao Wang, Lizhuang Ma

Figure 1 for Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration

Figure 2 for Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration

Figure 3 for Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration

Figure 4 for Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration

Abstract:Image Restoration aims to restore degraded images, with deep learning, especially CNNs and Transformers, enhancing performance. However, there's a lack of a unified training benchmark for IR. We identified a bias in image complexity between training and testing datasets, affecting restoration quality. To address this, we created ReSyn, a large-scale IR dataset with balanced complexity, including real and synthetic images. We also established a unified training standard for IR models. Our RWKV-IR model integrates linear complexity RWKV into transformers for global and local receptive fields. It replaces Q-Shift with Depth-wise Convolution for local dependencies and combines Bi-directional attention for global-local awareness. The Cross-Bi-WKV module balances horizontal and vertical attention. Experiments show RWKV-IR's effectiveness in image restoration.

Via

Access Paper or Ask Questions

Breast Tumor Classification Using EfficientNet Deep Learning Model

Nov 26, 2024

Majid Behzadpour, Bengie L. Ortiz, Ebrahim Azizi, Kai Wu

Abstract:Precise breast cancer classification on histopathological images has the potential to greatly improve the diagnosis and patient outcome in oncology. The data imbalance problem largely stems from the inherent imbalance within medical image datasets, where certain tumor subtypes may appear much less frequently. This constitutes a considerable limitation in biased model predictions that can overlook critical but rare classes. In this work, we adopted EfficientNet, a state-of-the-art convolutional neural network (CNN) model that balances high accuracy with computational cost efficiency. To address data imbalance, we introduce an intensive data augmentation pipeline and cost-sensitive learning, improving representation and ensuring that the model does not overly favor majority classes. This approach provides the ability to learn effectively from rare tumor types, improving its robustness. Additionally, we fine-tuned the model using transfer learning, where weights in the beginning trained on a binary classification task were adopted to multi-class classification, improving the capability to detect complex patterns within the BreakHis dataset. Our results underscore significant improvements in the binary classification performance, achieving an exceptional recall increase for benign cases from 0.92 to 0.95, alongside an accuracy enhancement from 97.35 % to 98.23%. Our approach improved the performance of multi-class tasks from 91.27% with regular augmentation to 94.54% with intensive augmentation, reaching 95.04% with transfer learning. This framework demonstrated substantial gains in precision in the minority classes, such as Mucinous carcinoma and Papillary carcinoma, while maintaining high recall consistently across these critical subtypes, as further confirmed by confusion matrix analysis.

* 19 pages, 7 figures

Via

Access Paper or Ask Questions