Abstract:The rise of multimodal large language models (MLLMs) has sparked an unprecedented wave of applications in the field of medical imaging analysis. However, as one of the earliest and most fundamental tasks integrated into this paradigm, medical image classification reveals a sobering reality: state-of-the-art medical MLLMs consistently underperform compared to traditional deep learning models, despite their overwhelming advantages in pre-training data and model parameters. This paradox prompts a critical rethinking: where exactly does the performance degradation originate? In this paper, we conduct extensive experiments on 14 open-source medical MLLMs across three representative image classification datasets. Moving beyond superficial performance benchmarking, we employ feature probing to track the information flow of visual features module-by-module and layer-by-layer throughout the entire MLLM pipeline, enabling explicit visualization of where and how classification signals are distorted, diluted, or overridden. As the first attempt to dissect classification performance degradation in medical MLLMs, our findings reveal four failure modes: 1) quality limitation in visual representation, 2) fidelity loss in connector projection, 3) comprehension deficit in LLM reasoning, and 4) misalignment of semantic mapping. Meanwhile, we introduce quantitative scores that characterize the healthiness of feature evolution, enabling principled comparisons across diverse MLLMs and datasets. Furthermore, we provide insightful discussions centered on the critical barriers that prevent current medical MLLMs from fulfilling their promised clinical potential. We hope that our work provokes rethinking within the community-highlighting that the road from high expectations to clinically deployable MLLMs remains long and winding.
Abstract:Medical image analysis is a fundamental component. As deep learning progresses, the focus has shifted from single-task applications, such as classification and segmentation, to more complex multimodal tasks, including medical visual question answering and report generation. Traditional shallow and task-specific models are increasingly limited in addressing the complexity and scalability required in clinical practice. The emergence of large language models (LLMs) has driven the development of medical Large Vision-Language Models (LVLMs), offering a unified solution for diverse vision-language tasks. In this study, we investigate various architectural designs for medical LVLMs based on the widely adopted LLaVA framework, which follows an encoder-connector-LLM paradigm. We construct two distinct models targeting 2D and 3D modalities, respectively. These models are designed to support both general-purpose medical tasks and domain-specific fine-tuning, thereby serving as effective foundation models. To facilitate reproducibility and further research, we develop a modular and extensible codebase, MedM-VL, and release two LVLM variants: MedM-VL-2D for 2D medical image analysis and MedM-VL-CT-Chest for 3D CT-based applications. The code and models are available at: https://github.com/MSIIP/MedM-VL