Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fengying Xie

Global Context or Local Detail? Adaptive Visual Grounding for Hallucination Mitigation

Apr 27, 2026

Yubo Jiang, Xin Yang, Abudukelimu Wuerkaixi, Zheming Yuan, Xuxin Cheng, Fengying Xie, Zhiguo Jiang, Cao Liu, Ke Zeng, Haopeng Zhang

Abstract:Vision-Language Models (VLMs) are frequently undermined by object hallucination--generating content that contradicts visual reality--due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our key finding of a critical attention deficit in VLMs, where visual features are empirically under-weighted. Our framework corrects this via a dual-path contrast: The positive path amplifies salient visual evidence using multi-layer attention to encourage faithful descriptions, directly counteracting the attention deficit. Simultaneously, the negative path identifies and degrades the core object's features to create a strong counterfactual, which penalizes ungrounded, prior-dominant generation. By contrasting the model's outputs from these two perspectives at each step, PND steers generation towards text that is not just linguistically probable, but visually factual. Extensive experiments on benchmarks like POPE, MME, and CHAIR show that PND achieves state-of-the-art performance with up to 6.5% accuracy improvement, substantially reducing object hallucination while also enhancing descriptive detail--all without requiring any model retraining. The method generalizes effectively across diverse VLM architectures including LLaVA, InstructBLIP, InternVL, and Qwen-VL.

* 9 pages, 8 figures, Findings of ACL 2025

Via

Access Paper or Ask Questions

V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization

Apr 22, 2026

Yubo Jiang, Yitong An, Xin Yang, Abudukelimu Wuerkaixi, Xuxin Cheng, Fengying Xie, Zhiguo Jiang, Cao Liu, Ke Zeng, Haopeng Zhang

Abstract:We introduce V-tableR1, a process-supervised reinforcement learning framework that elicits rigorous, verifiable reasoning from multimodal large language models (MLLMs). Current MLLMs trained solely on final outcomes often treat visual reasoning as a black box, relying on superficial pattern matching rather than performing rigorous multi-step inference. While Reinforcement Learning with Verifiable Rewards could enforce transparent reasoning trajectories, extending it to visual domains remains severely hindered by the ambiguity of grounding abstract logic into continuous pixel space. We solve this by leveraging the deterministic grid structure of tables as an ideal visual testbed. V-tableR1 employs a specialized critic VLM to provide dense, step-level feedback on the explicit visual chain-of-thought generated by a policy VLM. To optimize this system, we propose Process-Guided Direct Alignment Policy Optimization (PGPO), a novel RL algorithm integrating process rewards, decoupled policy constraints, and length-aware dynamic sampling. Extensive evaluations demonstrate that V-tableR1 explicitly penalizes visual hallucinations and shortcut guessing. By fundamentally shifting multimodal inference from black-box pattern matching to verifiable logical derivation, V-tableR1 4B establishes state-of-the-art accuracy among open-source models on complex tabular benchmarks, outperforming models up to 18x its size and improving over its SFT baseline

* 15 pages, 4 figures, 4 tables

Via

Access Paper or Ask Questions

Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts

Dec 21, 2025

Linwei Qiu, Gongzhe Li, Xiaozhe Zhang, Qinlin Sun, Fengying Xie

Figure 1 for Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts

Figure 2 for Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts

Figure 3 for Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts

Figure 4 for Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts

Abstract:Image correction and rectangling are valuable tasks in practical photography systems such as smartphones. Recent remarkable advancements in deep learning have undeniably brought about substantial performance improvements in these fields. Nevertheless, existing methods mainly rely on task-specific architectures. This significantly restricts their generalization ability and effective application across a wide range of different tasks. In this paper, we introduce the Unified Rectification Framework (UniRect), a comprehensive approach that addresses these practical tasks from a consistent distortion rectification perspective. Our approach incorporates various task-specific inverse problems into a general distortion model by simulating different types of lenses. To handle diverse distortions, UniRect adopts one task-agnostic rectification framework with a dual-component structure: a {Deformation Module}, which utilizes a novel Residual Progressive Thin-Plate Spline (RP-TPS) model to address complex geometric deformations, and a subsequent Restoration Module, which employs Residual Mamba Blocks (RMBs) to counteract the degradation caused by the deformation process and enhance the fidelity of the output image. Moreover, a Sparse Mixture-of-Experts (SMoEs) structure is designed to circumvent heavy task competition in multi-task learning due to varying distortions. Extensive experiments demonstrate that our models have achieved state-of-the-art performance compared with other up-to-date methods.

* AAAI 2026

Via

Access Paper or Ask Questions

Adapt CLIP as Aggregation Instructor for Image Dehazing

Aug 22, 2024

Xiaozhe Zhang, Fengying Xie, Haidong Ding, Linpeng Pan, Zhenwei Shi

Figure 1 for Adapt CLIP as Aggregation Instructor for Image Dehazing

Figure 2 for Adapt CLIP as Aggregation Instructor for Image Dehazing

Figure 3 for Adapt CLIP as Aggregation Instructor for Image Dehazing

Figure 4 for Adapt CLIP as Aggregation Instructor for Image Dehazing

Abstract:Most dehazing methods suffer from limited receptive field and do not explore the rich semantic prior encapsulated in vision-language models, which have proven effective in downstream tasks. In this paper, we introduce CLIPHaze, a pioneering hybrid framework that synergizes the efficient global modeling of Mamba with the prior knowledge and zero-shot capabilities of CLIP to address both issues simultaneously. Specifically, our method employs parallel state space model and window-based self-attention to obtain global contextual dependency and local fine-grained perception, respectively. To seamlessly aggregate information from both paths, we introduce CLIP-instructed Aggregation Module (CAM). For non-homogeneous and homogeneous haze, CAM leverages zero-shot estimated haze density map and high-quality image embedding without degradation information to explicitly and implicitly determine the optimal neural operation range for each pixel, thereby adaptively fusing two paths with different receptive fields. Extensive experiments on various benchmarks demonstrate that CLIPHaze achieves state-of-the-art (SOTA) performance, particularly in non-homogeneous haze. Code will be publicly after acceptance.

* 12 pages, 6 figures

Via

Access Paper or Ask Questions

Pan-cancer Histopathology WSI Pre-training with Position-aware Masked Autoencoder

Jul 10, 2024

Kun Wu, Zhiguo Jiang, Kunming Tang, Jun Shi, Fengying Xie, Wei Wang, Haibo Wu, Yushan Zheng

Figure 1 for Pan-cancer Histopathology WSI Pre-training with Position-aware Masked Autoencoder

Figure 2 for Pan-cancer Histopathology WSI Pre-training with Position-aware Masked Autoencoder

Figure 3 for Pan-cancer Histopathology WSI Pre-training with Position-aware Masked Autoencoder

Figure 4 for Pan-cancer Histopathology WSI Pre-training with Position-aware Masked Autoencoder

Abstract:Large-scale pre-training models have promoted the development of histopathology image analysis. However, existing self-supervised methods for histopathology images focus on learning patch features, while there is still a lack of available pre-training models for WSI-level feature learning. In this paper, we propose a novel self-supervised learning framework for pan-cancer WSI-level representation pre-training with the designed position-aware masked autoencoder (PAMA). Meanwhile, we propose the position-aware cross-attention (PACA) module with a kernel reorientation (KRO) strategy and an anchor dropout (AD) mechanism. The KRO strategy can capture the complete semantic structure and eliminate ambiguity in WSIs, and the AD contributes to enhancing the robustness and generalization of the model. We evaluated our method on 6 large-scale datasets from multiple organs for pan-cancer classification tasks. The results have demonstrated the effectiveness of PAMA in generalized and discriminative WSI representation learning and pan-cancer WSI pre-training. The proposed method was also compared with \R{7} WSI analysis methods. The experimental results have indicated that our proposed PAMA is superior to the state-of-the-art methods.The code and checkpoints are available at https://github.com/WkEEn/PAMA.

Via

Access Paper or Ask Questions

Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction

Jan 03, 2024

Yilan Zhang, Yingxue Xu, Jianqi Chen, Fengying Xie, Hao Chen

Figure 1 for Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction

Figure 2 for Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction

Figure 3 for Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction

Figure 4 for Prototypical Information Bottlenecking and Disentangling for Multimodal Cancer Survival Prediction

Abstract:Multimodal learning significantly benefits cancer survival prediction, especially the integration of pathological images and genomic data. Despite advantages of multimodal learning for cancer survival prediction, massive redundancy in multimodal data prevents it from extracting discriminative and compact information: (1) An extensive amount of intra-modal task-unrelated information blurs discriminability, especially for gigapixel whole slide images (WSIs) with many patches in pathology and thousands of pathways in genomic data, leading to an ``intra-modal redundancy" issue. (2) Duplicated information among modalities dominates the representation of multimodal data, which makes modality-specific information prone to being ignored, resulting in an ``inter-modal redundancy" issue. To address these, we propose a new framework, Prototypical Information Bottlenecking and Disentangling (PIBD), consisting of Prototypical Information Bottleneck (PIB) module for intra-modal redundancy and Prototypical Information Disentanglement (PID) module for inter-modal redundancy. Specifically, a variant of information bottleneck, PIB, is proposed to model prototypes approximating a bunch of instances for different risk levels, which can be used for selection of discriminative instances within modality. PID module decouples entangled multimodal data into compact distinct components: modality-common and modality-specific knowledge, under the guidance of the joint prototypical distribution. Extensive experiments on five cancer benchmark datasets demonstrated our superiority over other methods.

Via

Access Paper or Ask Questions

ECL: Class-Enhancement Contrastive Learning for Long-tailed Skin Lesion Classification

Jul 09, 2023

Yilan Zhang, Jianqi Chen, Ke Wang, Fengying Xie

Figure 1 for ECL: Class-Enhancement Contrastive Learning for Long-tailed Skin Lesion Classification

Figure 2 for ECL: Class-Enhancement Contrastive Learning for Long-tailed Skin Lesion Classification

Figure 3 for ECL: Class-Enhancement Contrastive Learning for Long-tailed Skin Lesion Classification

Figure 4 for ECL: Class-Enhancement Contrastive Learning for Long-tailed Skin Lesion Classification

Abstract:Skin image datasets often suffer from imbalanced data distribution, exacerbating the difficulty of computer-aided skin disease diagnosis. Some recent works exploit supervised contrastive learning (SCL) for this long-tailed challenge. Despite achieving significant performance, these SCL-based methods focus more on head classes, yet ignoring the utilization of information in tail classes. In this paper, we propose class-Enhancement Contrastive Learning (ECL), which enriches the information of minority classes and treats different classes equally. For information enhancement, we design a hybrid-proxy model to generate class-dependent proxies and propose a cycle update strategy for parameters optimization. A balanced-hybrid-proxy loss is designed to exploit relations between samples and proxies with different classes treated equally. Taking both "imbalanced data" and "imbalanced diagnosis difficulty" into account, we further present a balanced-weighted cross-entropy loss following curriculum learning schedule. Experimental results on the classification of imbalanced skin lesion data have demonstrated the superiority and effectiveness of our method.

Via

Access Paper or Ask Questions

TFormer: A throughout fusion transformer for multi-modal skin lesion diagnosis

Nov 21, 2022

Yilan Zhang, Fengying Xie, Jianqi Chen, Jie Liu

Figure 1 for TFormer: A throughout fusion transformer for multi-modal skin lesion diagnosis

Figure 2 for TFormer: A throughout fusion transformer for multi-modal skin lesion diagnosis

Figure 3 for TFormer: A throughout fusion transformer for multi-modal skin lesion diagnosis

Figure 4 for TFormer: A throughout fusion transformer for multi-modal skin lesion diagnosis

Abstract:Multi-modal skin lesion diagnosis (MSLD) has achieved remarkable success by modern computer-aided diagnosis technology based on deep convolutions. However, the information aggregation across modalities in MSLD remains challenging due to severity unaligned spatial resolution (dermoscopic image and clinical image) and heterogeneous data (dermoscopic image and patients' meta-data). Limited by the intrinsic local attention, most recent MSLD pipelines using pure convolutions struggle to capture representative features in shallow layers, thus the fusion across different modalities is usually done at the end of the pipelines, even at the last layer, leading to an insufficient information aggregation. To tackle the issue, we introduce a pure transformer-based method, which we refer to as ``Throughout Fusion Transformer (TFormer)", for sufficient information intergration in MSLD. Different from the existing approaches with convolutions, the proposed network leverages transformer as feature extraction backbone, bringing more representative shallow features. We then carefully design a stack of dual-branch hierarchical multi-modal transformer (HMT) blocks to fuse information across different image modalities in a stage-by-stage way. With the aggregated information of image modalities, a multi-modal transformer post-fusion (MTP) block is designed to integrate features across image and non-image data. Such a strategy that information of the image modalities is firstly fused then the heterogeneous ones enables us to better divide and conquer the two major challenges while ensuring inter-modality dynamics are effectively modeled. Experiments conducted on the public Derm7pt dataset validate the superiority of the proposed method. Our TFormer outperforms other state-of-the-art methods. Ablation experiments also suggest the effectiveness of our designs.

Via

Access Paper or Ask Questions

A Rotation Meanout Network with Invariance for Dermoscopy Image Classification and Retrieval

Aug 01, 2022

Yilan Zhang, Fengying Xie, Xuedong Song, Hangning Zhou, Yiguang Yang, Haopeng Zhang, Jie Liu

Figure 1 for A Rotation Meanout Network with Invariance for Dermoscopy Image Classification and Retrieval

Figure 2 for A Rotation Meanout Network with Invariance for Dermoscopy Image Classification and Retrieval

Figure 3 for A Rotation Meanout Network with Invariance for Dermoscopy Image Classification and Retrieval

Figure 4 for A Rotation Meanout Network with Invariance for Dermoscopy Image Classification and Retrieval

Abstract:The computer-aided diagnosis (CAD) system can provide a reference basis for the clinical diagnosis of skin diseases. Convolutional neural networks (CNNs) can not only extract visual elements such as colors and shapes but also semantic features. As such they have made great improvements in many tasks of dermoscopy images. The imaging of dermoscopy has no main direction, indicating that there are a large number of skin lesion target rotations in the datasets. However, CNNs lack anti-rotation ability, which is bound to affect the feature extraction ability of CNNs. We propose a rotation meanout (RM) network to extract rotation invariance features from dermoscopy images. In RM, each set of rotated feature maps corresponds to a set of weight-sharing convolution outputs and they are fused using meanout operation to obtain the final feature maps. Through theoretical derivation, the proposed RM network is rotation-equivariant and can extract rotation-invariant features when being followed by the global average pooling (GAP) operation. The extracted rotation-invariant features can better represent the original data in classification and retrieval tasks for dermoscopy images. The proposed RM is a general operation, which does not change the network structure or increase any parameter, and can be flexibly embedded in any part of CNNs. Extensive experiments are conducted on a dermoscopy image dataset. The results show our method outperforms other anti-rotation methods and achieves great improvements in dermoscopy image classification and retrieval tasks, indicating the potential of rotation invariance in the field of dermoscopy images.

Via

Access Paper or Ask Questions

Kernel Attention Transformer (KAT) for Histopathology Whole Slide Image Classification

Jun 27, 2022

Yushan Zheng, Jun Li, Jun Shi, Fengying Xie, Zhiguo Jiang

Figure 1 for Kernel Attention Transformer (KAT) for Histopathology Whole Slide Image Classification

Figure 2 for Kernel Attention Transformer (KAT) for Histopathology Whole Slide Image Classification

Figure 3 for Kernel Attention Transformer (KAT) for Histopathology Whole Slide Image Classification

Figure 4 for Kernel Attention Transformer (KAT) for Histopathology Whole Slide Image Classification

Abstract:Transformer has been widely used in histopathology whole slide image (WSI) classification for the purpose of tumor grading, prognosis analysis, etc. However, the design of token-wise self-attention and positional embedding strategy in the common Transformer limits the effectiveness and efficiency in the application to gigapixel histopathology images. In this paper, we propose a kernel attention Transformer (KAT) for histopathology WSI classification. The information transmission of the tokens is achieved by cross-attention between the tokens and a set of kernels related to a set of positional anchors on the WSI. Compared to the common Transformer structure, the proposed KAT can better describe the hierarchical context information of the local regions of the WSI and meanwhile maintains a lower computational complexity. The proposed method was evaluated on a gastric dataset with 2040 WSIs and an endometrial dataset with 2560 WSIs, and was compared with 6 state-of-the-art methods. The experimental results have demonstrated the proposed KAT is effective and efficient in the task of histopathology WSI classification and is superior to the state-of-the-art methods. The code is available at https://github.com/zhengyushan/kat.

* accepted for MICCAI 2022

Via

Access Paper or Ask Questions