Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiaxiang Liu

Capability Localization: Capabilities Can be Localized rather than Individual Knowledge

Feb 28, 2025

Xiusheng Huang, Jiaxiang Liu, Yequan Wang, Jun Zhao, Kang Liu

Figure 1 for Capability Localization: Capabilities Can be Localized rather than Individual Knowledge

Figure 2 for Capability Localization: Capabilities Can be Localized rather than Individual Knowledge

Figure 3 for Capability Localization: Capabilities Can be Localized rather than Individual Knowledge

Figure 4 for Capability Localization: Capabilities Can be Localized rather than Individual Knowledge

Abstract:Large scale language models have achieved superior performance in tasks related to natural language processing, however, it is still unclear how model parameters affect performance improvement. Previous studies assumed that individual knowledge is stored in local parameters, and the storage form of individual knowledge is dispersed parameters, parameter layers, or parameter chains, which are not unified. We found through fidelity and reliability evaluation experiments that individual knowledge cannot be localized. Afterwards, we constructed a dataset for decoupling experiments and discovered the potential for localizing data commonalities. To further reveal this phenomenon, this paper proposes a Commonality Neuron Localization (CNL) method, which successfully locates commonality neurons and achieves a neuron overlap rate of 96.42% on the GSM8K dataset. Finally, we have demonstrated through cross data experiments that commonality neurons are a collection of capability neurons that possess the capability to enhance performance. Our code is available at https://github.com/nlpkeg/Capability-Neuron-Localization.

Via

Access Paper or Ask Questions

Fair-MoE: Fairness-Oriented Mixture of Experts in Vision-Language Models

Feb 10, 2025

Peiran Wang, Linjie Tong, Jiaxiang Liu, Zuozhu Liu

Figure 1 for Fair-MoE: Fairness-Oriented Mixture of Experts in Vision-Language Models

Figure 2 for Fair-MoE: Fairness-Oriented Mixture of Experts in Vision-Language Models

Figure 3 for Fair-MoE: Fairness-Oriented Mixture of Experts in Vision-Language Models

Figure 4 for Fair-MoE: Fairness-Oriented Mixture of Experts in Vision-Language Models

Abstract:Fairness is a fundamental principle in medical ethics. Vision Language Models (VLMs) have shown significant potential in the medical field due to their ability to leverage both visual and linguistic contexts, reducing the need for large datasets and enabling the performance of complex tasks. However, the exploration of fairness within VLM applications remains limited. Applying VLMs without a comprehensive analysis of fairness could lead to concerns about equal treatment opportunities and diminish public trust in medical deep learning models. To build trust in medical VLMs, we propose Fair-MoE, a model specifically designed to ensure both fairness and effectiveness. Fair-MoE comprises two key components: \textit{the Fairness-Oriented Mixture of Experts (FO-MoE)} and \textit{the Fairness-Oriented Loss (FOL)}. FO-MoE is designed to leverage the expertise of various specialists to filter out biased patch embeddings and use an ensemble approach to extract more equitable information relevant to specific tasks. FOL is a novel fairness-oriented loss function that not only minimizes the distances between different attributes but also optimizes the differences in the dispersion of various attributes' distributions. Extended experiments demonstrate the effectiveness and fairness of Fair-MoE. Tested on the Harvard-FairVLMed dataset, Fair-MoE showed improvements in both fairness and accuracy across all four attributes. Code will be publicly available.

Via

Access Paper or Ask Questions

KPL: Training-Free Medical Knowledge Mining of Vision-Language Models

Jan 20, 2025

Jiaxiang Liu, Tianxiang Hu, Jiawei Du, Ruiyuan Zhang, Joey Tianyi Zhou, Zuozhu Liu

Figure 1 for KPL: Training-Free Medical Knowledge Mining of Vision-Language Models

Figure 2 for KPL: Training-Free Medical Knowledge Mining of Vision-Language Models

Figure 3 for KPL: Training-Free Medical Knowledge Mining of Vision-Language Models

Figure 4 for KPL: Training-Free Medical Knowledge Mining of Vision-Language Models

Abstract:Visual Language Models such as CLIP excel in image recognition due to extensive image-text pre-training. However, applying the CLIP inference in zero-shot classification, particularly for medical image diagnosis, faces challenges due to: 1) the inadequacy of representing image classes solely with single category names; 2) the modal gap between the visual and text spaces generated by CLIP encoders. Despite attempts to enrich disease descriptions with large language models, the lack of class-specific knowledge often leads to poor performance. In addition, empirical evidence suggests that existing proxy learning methods for zero-shot image classification on natural image datasets exhibit instability when applied to medical datasets. To tackle these challenges, we introduce the Knowledge Proxy Learning (KPL) to mine knowledge from CLIP. KPL is designed to leverage CLIP's multimodal understandings for medical image classification through Text Proxy Optimization and Multimodal Proxy Learning. Specifically, KPL retrieves image-relevant knowledge descriptions from the constructed knowledge-enhanced base to enrich semantic text proxies. It then harnesses input images and these descriptions, encoded via CLIP, to stably generate multimodal proxies that boost the zero-shot classification performance. Extensive experiments conducted on both medical and natural image datasets demonstrate that KPL enables effective zero-shot image classification, outperforming all baselines. These findings highlight the great potential in this paradigm of mining knowledge from CLIP for medical image classification and broader areas.

* AAAI(Oral)

Via

Access Paper or Ask Questions

MedCoT: Medical Chain of Thought via Hierarchical Expert

Dec 18, 2024

Jiaxiang Liu, Yuan Wang, Jiawei Du, Joey Tianyi Zhou, Zuozhu Liu

Figure 1 for MedCoT: Medical Chain of Thought via Hierarchical Expert

Figure 2 for MedCoT: Medical Chain of Thought via Hierarchical Expert

Figure 3 for MedCoT: Medical Chain of Thought via Hierarchical Expert

Figure 4 for MedCoT: Medical Chain of Thought via Hierarchical Expert

Abstract:Artificial intelligence has advanced in Medical Visual Question Answering (Med-VQA), but prevalent research tends to focus on the accuracy of the answers, often overlooking the reasoning paths and interpretability, which are crucial in clinical settings. Besides, current Med-VQA algorithms, typically reliant on singular models, lack the robustness needed for real-world medical diagnostics which usually require collaborative expert evaluation. To address these shortcomings, this paper presents MedCoT, a novel hierarchical expert verification reasoning chain method designed to enhance interpretability and accuracy in biomedical imaging inquiries. MedCoT is predicated on two principles: The necessity for explicit reasoning paths in Med-VQA and the requirement for multi-expert review to formulate accurate conclusions. The methodology involves an Initial Specialist proposing diagnostic rationales, followed by a Follow-up Specialist who validates these rationales, and finally, a consensus is reached through a vote among a sparse Mixture of Experts within the locally deployed Diagnostic Specialist, which then provides the definitive diagnosis. Experimental evaluations on four standard Med-VQA datasets demonstrate that MedCoT surpasses existing state-of-the-art approaches, providing significant improvements in performance and interpretability.

* EMNLP 2024

Via

Access Paper or Ask Questions

PVP: Polar Representation Boost for 3D Semantic Occupancy Prediction

Dec 10, 2024

Yujing Xue, Jiaxiang Liu, Jiawei Du, Joey Tianyi Zhou

Figure 1 for PVP: Polar Representation Boost for 3D Semantic Occupancy Prediction

Figure 2 for PVP: Polar Representation Boost for 3D Semantic Occupancy Prediction

Figure 3 for PVP: Polar Representation Boost for 3D Semantic Occupancy Prediction

Figure 4 for PVP: Polar Representation Boost for 3D Semantic Occupancy Prediction

Abstract:Recently, polar coordinate-based representations have shown promise for 3D perceptual tasks. Compared to Cartesian methods, polar grids provide a viable alternative, offering better detail preservation in nearby spaces while covering larger areas. However, they face feature distortion due to non-uniform division. To address these issues, we introduce the Polar Voxel Occupancy Predictor (PVP), a novel 3D multi-modal predictor that operates in polar coordinates. PVP features two key design elements to overcome distortion: a Global Represent Propagation (GRP) module that integrates global spatial data into 3D volumes, and a Plane Decomposed Convolution (PD-Conv) that simplifies 3D distortions into 2D convolutions. These innovations enable PVP to outperform existing methods, achieving significant improvements in mIoU and IoU metrics on the OpenOccupancy dataset.

Via

Access Paper or Ask Questions

Reasons and Solutions for the Decline in Model Performance after Editing

Oct 31, 2024

Xiusheng Huang, Jiaxiang Liu, Yequan Wang, Kang Liu

Figure 1 for Reasons and Solutions for the Decline in Model Performance after Editing

Figure 2 for Reasons and Solutions for the Decline in Model Performance after Editing

Figure 3 for Reasons and Solutions for the Decline in Model Performance after Editing

Figure 4 for Reasons and Solutions for the Decline in Model Performance after Editing

Abstract:Knowledge editing technology has received widespread attention for low-cost updates of incorrect or outdated knowledge in large-scale language models. However, recent research has found that edited models often exhibit varying degrees of performance degradation. The reasons behind this phenomenon and potential solutions have not yet been provided. In order to investigate the reasons for the performance decline of the edited model and optimize the editing method, this work explores the underlying reasons from both data and model perspectives. Specifically, 1) from a data perspective, to clarify the impact of data on the performance of editing models, this paper first constructs a Multi-Question Dataset (MQD) to evaluate the impact of different types of editing data on model performance. The performance of the editing model is mainly affected by the diversity of editing targets and sequence length, as determined through experiments. 2) From a model perspective, this article explores the factors that affect the performance of editing models. The results indicate a strong correlation between the L1-norm of the editing model layer and the editing accuracy, and clarify that this is an important factor leading to the bottleneck of editing performance. Finally, in order to improve the performance of the editing model, this paper further proposes a Dump for Sequence (D4S) method, which successfully overcomes the previous editing bottleneck by reducing the L1-norm of the editing layer, allowing users to perform multiple effective edits and minimizing model damage. Our code is available at https://github.com/nlpkeg/D4S.

* NeurIPS 2024
* 14 pages, 8 figures

Via

Access Paper or Ask Questions

R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest

Oct 27, 2024

Xupeng Chen, Zhixin Lai, Kangrui Ruan, Shichu Chen, Jiaxiang Liu, Zuozhu Liu

Figure 1 for R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest

Figure 2 for R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest

Figure 3 for R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest

Figure 4 for R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest

Abstract:Artificial intelligence has made significant strides in medical visual question answering (Med-VQA), yet prevalent studies often interpret images holistically, overlooking the visual regions of interest that may contain crucial information, potentially aligning with a doctor's prior knowledge that can be incorporated with minimal annotations (e.g., bounding boxes). To address this gap, this paper introduces R-LLaVA, designed to enhance biomedical VQA understanding by integrating simple medical annotations as prior knowledge directly into the image space through CLIP. These annotated visual regions of interest are then fed into the LLaVA model during training, aiming to enrich the model's understanding of biomedical queries. Experimental evaluation on four standard Med-VQA datasets demonstrates R-LLaVA's superiority over existing state-of-the-art (SoTA) methods. Additionally, to verify the model's capability in visual comprehension, a novel multiple-choice medical visual understanding dataset is introduced, confirming the positive impact of focusing on visual regions of interest in advancing biomedical VQA understanding.

* 11 pages, 7 figures, submitted to NAACL 2025

Via

Access Paper or Ask Questions

SGW-based Multi-Task Learning in Vision Tasks

Oct 03, 2024

Ruiyuan Zhang, Yuyao Chen, Yuchi Huo, Jiaxiang Liu, Dianbing Xi, Jie Liu, Chao Wu

Abstract:Multi-task-learning(MTL) is a multi-target optimization task. Neural networks try to realize each target using a shared interpretative space within MTL. However, as the scale of datasets expands and the complexity of tasks increases, knowledge sharing becomes increasingly challenging. In this paper, we first re-examine previous cross-attention MTL methods from the perspective of noise. We theoretically analyze this issue and identify it as a flaw in the cross-attention mechanism. To address this issue, we propose an information bottleneck knowledge extraction module (KEM). This module aims to reduce inter-task interference by constraining the flow of information, thereby reducing computational complexity. Furthermore, we have employed neural collapse to stabilize the knowledge-selection process. That is, before input to KEM, we projected the features into ETF space. This mapping makes our method more robust. We implemented and conducted comparative experiments with this method on multiple datasets. The results demonstrate that our approach significantly outperforms existing methods in multi-task learning.

* ACCV2024

Via

Access Paper or Ask Questions

MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

Apr 18, 2024

Xiaotang Gai, Chenyi Zhou, Jiaxiang Liu, Yang Feng, Jian Wu, Zuozhu Liu

Figure 1 for MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

Figure 2 for MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

Figure 3 for MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

Figure 4 for MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale

Abstract:Medical Visual Question Answering (MedVQA), which offers language responses to image-based medical inquiries, represents a challenging task and significant advancement in healthcare. It assists medical experts to swiftly interpret medical images, thereby enabling faster and more accurate diagnoses. However, the model interpretability and transparency of existing MedVQA solutions are often limited, posing challenges in understanding their decision-making processes. To address this issue, we devise a semi-automated annotation process to streamlining data preparation and build new benchmark MedVQA datasets R-RAD and R-SLAKE. The R-RAD and R-SLAKE datasets provide intermediate medical decision-making rationales generated by multimodal large language models and human annotations for question-answering pairs in existing MedVQA datasets, i.e., VQA-RAD and SLAKE. Moreover, we design a novel framework which finetunes lightweight pretrained generative models by incorporating medical decision-making rationales into the training process. The framework includes three distinct strategies to generate decision outcomes and corresponding rationales, thereby clearly showcasing the medical decision-making process during reasoning. Extensive experiments demonstrate that our method can achieve an accuracy of 83.5% on R-RAD and 86.3% on R-SLAKE, significantly outperforming existing state-of-the-art baselines. Dataset and code will be released.

Via

Access Paper or Ask Questions

Enhancing Large Language Models with Pseudo- and Multisource- Knowledge Graphs for Open-ended Question Answering

Feb 15, 2024

Jiaxiang Liu, Tong Zhou, Yubo Chen, Kang Liu, Jun Zhao

Abstract:Mitigating the hallucinations of Large Language Models (LLMs) and enhancing them is a crucial task. Although some existing methods employ model self-enhancement techniques, they fall short of effectively addressing unknown factual hallucinations. Using Knowledge Graph (KG) enhancement approaches fails to address the generalization across different KG sources and the enhancement of open-ended answer questions simultaneously. To tackle these limitations, there is a framework that combines Pseudo-Graph Generation and Atomic Knowledge Verification proposed. The enhancement of LLM using KG in an open-ended question-answering setting is implemented by leveraging the Pseudo-Graph Generation. Atomic Knowledge Verification utilizes atomic-level knowledge querying and verification to achieve generalizability under different KG sources. Compared to the baseline, this approach yields a minimum improvement of 11.5 in the ROUGE-L score for open-ended questions. For precise questions, we observe a minimum accuracy improvement of 7.5. Moreover, there is also demonstration that this framework exhibits generalizability across different KG sources. In summary, our results pave the way for enhancing LLMs by incorporating Pseudo- and Multisource-KGs, particularly in the context of open-ended questions.

Via

Access Paper or Ask Questions