Abstract:Vision Language Model (VLM) has great potential to enhance the quality of pseudo labels in semi-supervised spine segmentation by leveraging textual class prompts to generate segmentation map, but no one has studied it yet. Although promising, it lacks explicit constraints to ensure consistency between spine class prompts and spine unit region, resulting in unsatisfactory performance in multi-class segmentation map generation. In this paper, we propose CPS4, the first text-guided semi-supervised spine segmentation network using class prompts to enhance the quality of spine pseudo labels. Specifically, CPS4 is implemented through two training stages. (i) Class-specific consistency constrained VLM pretraining stage: we propose token- and pixel-level attention loss to optimize the consistency between class prompts and spine units, forcing the textual class prompt to be closely coupled with the target spine unit in the semantic space. (ii) Class Prompt driven semi-supervised spine segmentation stage: using the pretrained vision-text encoder, we derive each class-specific binary segmentation map for the unlabeled spine image and integrate them into an unified multi-class segmentation map, improving the quality of the spine pseudo label generated by the semi-supervised spine segmentation network. Experimental results show that our CPS4 achieves superior spine segmentation performance with Dice of 80.44%, only using 5% labeled data on the public spine segmentation dataset, surpassing popular semi-supervised learning and VLM methods. Our code will be available.
Abstract:Medical visual grounding serves as a crucial foundation for fine-grained multimodal reasoning and interpretable clinical decision support. Despite recent advances in reinforcement learning (RL) for grounding tasks, existing approaches such as Group Relative Policy Optimization~(GRPO) suffer from severe reward sparsity when directly applied to medical images, primarily due to the inherent difficulty of localizing small or ambiguous regions of interest, which is further exacerbated by the rigid and suboptimal nature of fixed IoU-based reward schemes in RL. This leads to vanishing policy gradients and stagnated optimization, particularly during early training. To address this challenge, we propose MedLoc-R1, a performance-aware reward scheduling framework that progressively tightens the reward criterion in accordance with model readiness. MedLoc-R1 introduces a sliding-window performance tracker and a multi-condition update rule that automatically adjust the reward schedule from dense, easily obtainable signals to stricter, fine-grained localization requirements, while preserving the favorable properties of GRPO without introducing auxiliary networks or additional gradient paths. Experiments on three medical visual grounding benchmarks demonstrate that MedLoc-R1 consistently improves both localization accuracy and training stability over GRPO-based baselines. Our framework offers a general, lightweight, and effective solution for RL-based grounding in high-stakes medical applications. Code \& checkpoints are available at \hyperlink{}{https://github.com/MembrAI/MedLoc-R1}.
Abstract:While Vision-Language Models (VLMs) have garnered increasing attention in the AI community due to their promising practical applications, they exhibit persistent hallucination issues, generating outputs misaligned with visual inputs. Recent studies attribute these hallucinations to VLMs' over-reliance on linguistic priors and insufficient visual feature integration, proposing heuristic decoding calibration strategies to mitigate them. However, the non-trainable nature of these strategies inherently limits their optimization potential. To this end, we propose an adversarial parametric editing framework for Hallucination mitigation in VLMs, which follows an \textbf{A}ctivate-\textbf{L}ocate-\textbf{E}dit \textbf{A}dversarially paradigm. Specifically, we first construct an activation dataset that comprises grounded responses (positive samples attentively anchored in visual features) and hallucinatory responses (negative samples reflecting LLM prior bias and internal knowledge artifacts). Next, we identify critical hallucination-prone parameter clusters by analyzing differential hidden states of response pairs. Then, these clusters are fine-tuned using prompts injected with adversarial tuned prefixes that are optimized to maximize visual neglect, thereby forcing the model to prioritize visual evidence over inherent parametric biases. Evaluations on both generative and discriminative VLM tasks demonstrate the significant effectiveness of ALEAHallu in alleviating hallucinations. Our code is available at https://github.com/hujiayu1223/ALEAHallu.




Abstract:In this paper, we investigate integrated sensing and communication (ISAC) in a cell-free (CF) multiple-input multiple-output (MIMO) network with single-antenna access points (APs), where each AP functions either as a transmitter for both sensing and communication or as a receiver for target-reflected signals. We derive closed-form Cramer-Rao lower bounds (CRLBs) for location and velocity estimation under arbitrary power allocation ratios, assuming the radar cross-section (RCS) is deterministic and unknown over the observation interval. A power allocation optimization problem is formulated to maximize the communication signal-to-interference-plus-noise ratio (SINR), subject to CRLB-based sensing constraints and per-transmitter power limits. To solve the resulting nonlinear and non-convex problem, we propose a penalty function and projection-based modified conjugate gradient algorithm with inexact line search (PP-MCG-ILS), and an alternative method based on a modified steepest descent approach (PP-MSD-ILS). Additionally, for power minimization in pure sensing scenarios, we introduce a penalty function-based normalized conjugate gradient algorithm (P-NCG-ILS). We analyze the convergence behavior and qualitatively compare the computational complexity of the proposed algorithms. Simulation results confirm the accuracy of the derived CRLBs and demonstrate the effectiveness of the proposed power allocation strategies in enhancing both sensing and overall ISAC performance.
Abstract:This paper investigates joint location and velocity estimation, along with their fundamental performance bounds analysis, in a cell-free multi-input multi-output (MIMO) integrated sensing and communication (ISAC) system. First, unlike existing studies that derive likelihood functions for target parameter estimation using continuous received signals, we formulate the maximum likelihood estimation (MLE) for radar sensing based on discrete received signals at a given sampling rate. Second, leveraging the proposed MLEs, we derive closed-form Cramer-Rao lower bounds (CRLBs) for joint location and velocity estimation in both single-target and multiple-target scenarios. Third, to enhance computational efficiency, we propose approximate CRLBs and conduct an in-depth accuracy analysis. Additionally, we thoroughly examine the impact of sampling rate, squared effective bandwidth, and time width on CRLB performance. For multiple-target scenarios, the concepts of safety distance and safety velocity are introduced to characterize conditions under which the CRLBs for multiple targets converge to their single target counterparts. Finally, extensive simulations are conducted to verify the accuracy of the proposed CRLBs and the theoretical results using state-of-the-art waveforms, namely orthogonal frequency division multiplexing (OFDM) and orthogonal chirp division multiplexing (OCDM).




Abstract:Semi-supervised medical image segmentation (SSMIS) uses consistency learning to regularize model training, which alleviates the burden of pixel-wise manual annotations. However, it often suffers from error supervision from low-quality pseudo labels. Vision-Language Model (VLM) has great potential to enhance pseudo labels by introducing text prompt guided multimodal supervision information. It nevertheless faces the cross-modal problem: the obtained messages tend to correspond to multiple targets. To address aforementioned problems, we propose a Dual Semantic Similarity-Supervised VLM (DuSSS) for SSMIS. Specifically, 1) a Dual Contrastive Learning (DCL) is designed to improve cross-modal semantic consistency by capturing intrinsic representations within each modality and semantic correlations across modalities. 2) To encourage the learning of multiple semantic correspondences, a Semantic Similarity-Supervision strategy (SSS) is proposed and injected into each contrastive learning process in DCL, supervising semantic similarity via the distribution-based uncertainty levels. Furthermore, a novel VLM-based SSMIS network is designed to compensate for the quality deficiencies of pseudo-labels. It utilizes the pretrained VLM to generate text prompt guided supervision information, refining the pseudo label for better consistency regularization. Experimental results demonstrate that our DuSSS achieves outstanding performance with Dice of 82.52%, 74.61% and 78.03% on three public datasets (QaTa-COV19, BM-Seg and MoNuSeg).