Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jinlong He

RAFM: Retrieval-Augmented Flow Matching for Unpaired CBCT-to-CT Translation

Feb 28, 2026

Xianhao Zhou, Jianghao Wu, Lanfeng Zhong, Ku Zhao, Jinlong He, Shaoting Zhang, Guotai Wang

Abstract:Cone-beam CT (CBCT) is routinely acquired in radiotherapy but suffers from severe artifacts and unreliable Hounsfield Unit (HU) values, limiting its direct use for dose calculation. Synthetic CT (sCT) generation from CBCT is therefore an important task, yet paired CBCT--CT data are often unavailable or unreliable due to temporal gaps, anatomical variation, and registration errors. In this work, we introduce rectified flow (RF) into unpaired CBCT-to-CT translation in medical imaging. Although RF is theoretically compatible with unpaired learning through distribution-level coupling and deterministic transport, its practical effectiveness under small medical datasets and limited batch sizes remains underexplored. Direct application with random or batch-local pseudo pairing can produce unstable supervision due to semantically mismatched endpoint samples. To address this challenge, we propose Retrieval-Augmented Flow Matching (RAFM), which adapts RF to the medical setting by constructing retrieval-guided pseudo pairs using a frozen DINOv3 encoder and a global CT memory bank. This strategy improves empirical coupling quality and stabilizes unpaired flow-based training. Experiments on SynthRAD2023 under a strict subject-level true-unpaired protocol show that RAFM outperforms existing methods across FID, MAE, SSIM, PSNR, and SegScore. The code is available at https://github.com/HiLab-git/RAFM.git.

Via

Access Paper or Ask Questions

DINOv3-Guided Cross Fusion Framework for Semantic-aware CT generation from MRI and CBCT

Nov 15, 2025

Xianhao Zhou, Jianghao Wu, Ku Zhao, Jinlong He, Huangxuan Zhao, Lei Chen, Shaoting Zhang, Guotai Wang

Figure 1 for DINOv3-Guided Cross Fusion Framework for Semantic-aware CT generation from MRI and CBCT

Figure 2 for DINOv3-Guided Cross Fusion Framework for Semantic-aware CT generation from MRI and CBCT

Figure 3 for DINOv3-Guided Cross Fusion Framework for Semantic-aware CT generation from MRI and CBCT

Figure 4 for DINOv3-Guided Cross Fusion Framework for Semantic-aware CT generation from MRI and CBCT

Abstract:Generating synthetic CT images from CBCT or MRI has a potential for efficient radiation dose planning and adaptive radiotherapy. However, existing CNN-based models lack global semantic understanding, while Transformers often overfit small medical datasets due to high model capacity and weak inductive bias. To address these limitations, we propose a DINOv3-Guided Cross Fusion (DGCF) framework that integrates a frozen self-supervised DINOv3 Transformer with a trainable CNN encoder-decoder. It hierarchically fuses global representation of Transformer and local features of CNN via a learnable cross fusion module, achieving balanced local appearance and contextual representation. Furthermore, we introduce a Multi-Level DINOv3 Perceptual (MLDP) loss that encourages semantic similarity between synthetic CT and the ground truth in DINOv3's feature space. Experiments on the SynthRAD2023 pelvic dataset demonstrate that DGCF achieved state-of-the-art performance in terms of MS-SSIM, PSNR and segmentation-based metrics on both MRI$\rightarrow$CT and CBCT$\rightarrow$CT translation tasks. To the best of our knowledge, this is the first work to employ DINOv3 representations for medical image translation, highlighting the potential of self-supervised Transformer guidance for semantic-aware CT synthesis. The code is available at https://github.com/HiLab-git/DGCF.

Via

Access Paper or Ask Questions

Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

Oct 31, 2024

Jinlong He, Pengfei Li, Gang Liu, Shenjun Zhong

Figure 1 for Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

Figure 2 for Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

Figure 3 for Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

Figure 4 for Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

Abstract:Multimodal Large Language Models (MLLMs) inherit the superior text understanding capabilities of LLMs and extend these capabilities to multimodal scenarios. These models achieve excellent results in the general domain of multimodal tasks. However, in the medical domain, the substantial training costs and the requirement for extensive medical data pose challenges to the development of medical MLLMs. Furthermore, due to the free-text form of answers, tasks such as visual grounding that need to produce output in a prescribed form become difficult for MLLMs. So far, there have been no medical MLLMs works in medical visual grounding area. For the medical vision grounding task, which involves identifying locations in medical images based on short text descriptions, we propose Parameter-efficient Fine-tuning medical multimodal large language models for Medcial Visual Grounding (PFMVG). To validate the performance of the model, we evaluate it on a public benchmark dataset for medical visual grounding, where it achieves competitive results, and significantly outperforming GPT-4v. Our code will be open sourced after peer review.

Via

Access Paper or Ask Questions

PeFoMed: Parameter Efficient Fine-tuning on Multimodal Large Language Models for Medical Visual Question Answering

Jan 05, 2024

Jinlong He, Pengfei Li, Gang Liu, Zixu Zhao, Shenjun Zhong

Figure 1 for PeFoMed: Parameter Efficient Fine-tuning on Multimodal Large Language Models for Medical Visual Question Answering

Figure 2 for PeFoMed: Parameter Efficient Fine-tuning on Multimodal Large Language Models for Medical Visual Question Answering

Figure 3 for PeFoMed: Parameter Efficient Fine-tuning on Multimodal Large Language Models for Medical Visual Question Answering

Figure 4 for PeFoMed: Parameter Efficient Fine-tuning on Multimodal Large Language Models for Medical Visual Question Answering

Abstract:Multimodal large language models (MLLMs) represent an evolutionary expansion in the capabilities of traditional large language models, enabling them to tackle challenges that surpass the scope of purely text-based applications. It leverages the knowledge previously encoded within these language models, thereby enhancing their applicability and functionality in the reign of multimodal contexts. Recent works investigate the adaptation of MLLMs to predict free-form answers as a generative task to solve medical visual question answering (Med-VQA) tasks. In this paper, we propose a parameter efficient framework for fine-tuning MLLM specifically tailored to Med-VQA applications, and empirically validate it on a public benchmark dataset. To accurately measure the performance, we employ human evaluation and the results reveal that our model achieves an overall accuracy of 81.9%, and outperforms the GPT-4v model by a significant margin of 26% absolute accuracy on closed-ended questions. The code will be available here: https://github.com/jinlHe/PeFoMed.

Via

Access Paper or Ask Questions

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Jul 11, 2023

Pengfei Li, Gang Liu, Jinlong He, Zixu Zhao, Shenjun Zhong

Figure 1 for Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Figure 2 for Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Figure 3 for Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Figure 4 for Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Abstract:Medical visual question answering (VQA) is a challenging task that requires answering clinical questions of a given medical image, by taking consider of both visual and language information. However, due to the small scale of training data for medical VQA, pre-training fine-tuning paradigms have been a commonly used solution to improve model generalization performance. In this paper, we present a novel self-supervised approach that learns unimodal and multimodal feature representations of input images and text using medical image caption datasets, by leveraging both unimodal and multimodal contrastive losses, along with masked language modeling and image text matching as pretraining objectives. The pre-trained model is then transferred to downstream medical VQA tasks. The proposed approach achieves state-of-the-art (SOTA) performance on three publicly available medical VQA datasets with significant accuracy improvements of 2.2%, 14.7%, and 1.7% respectively. Besides, we conduct a comprehensive analysis to validate the effectiveness of different components of the approach and study different pre-training settings. Our codes and models are available at https://github.com/pengfeiliHEU/MUMC.

* accepted by MICCAI2023

Via

Access Paper or Ask Questions