Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaomin Wu

MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

Mar 10, 2026

Zongxia Li, Hongyang Du, Chengsong Huang, Xiyang Wu, Lantao Yu, Yicheng He, Jing Xie, Xiaomin Wu, Zhichao Liu, Jiarui Zhang(+1 more)

Abstract:Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.

Via

Access Paper or Ask Questions

PathInsight: Instruction Tuning of Multimodal Datasets and Models for Intelligence Assisted Diagnosis in Histopathology

Aug 13, 2024

Xiaomin Wu, Rui Xu, Pengchen Wei, Wenkang Qin, Peixiang Huang, Ziheng Li, Lin Luo

Figure 1 for PathInsight: Instruction Tuning of Multimodal Datasets and Models for Intelligence Assisted Diagnosis in Histopathology

Figure 2 for PathInsight: Instruction Tuning of Multimodal Datasets and Models for Intelligence Assisted Diagnosis in Histopathology

Figure 3 for PathInsight: Instruction Tuning of Multimodal Datasets and Models for Intelligence Assisted Diagnosis in Histopathology

Figure 4 for PathInsight: Instruction Tuning of Multimodal Datasets and Models for Intelligence Assisted Diagnosis in Histopathology

Abstract:Pathological diagnosis remains the definitive standard for identifying tumors. The rise of multimodal large models has simplified the process of integrating image analysis with textual descriptions. Despite this advancement, the substantial costs associated with training and deploying these complex multimodal models, together with a scarcity of high-quality training datasets, create a significant divide between cutting-edge technology and its application in the clinical setting. We had meticulously compiled a dataset of approximately 45,000 cases, covering over 6 different tasks, including the classification of organ tissues, generating pathology report descriptions, and addressing pathology-related questions and answers. We have fine-tuned multimodal large models, specifically LLaVA, Qwen-VL, InternLM, with this dataset to enhance instruction-based performance. We conducted a qualitative assessment of the capabilities of the base model and the fine-tuned model in performing image captioning and classification tasks on the specific dataset. The evaluation results demonstrate that the fine-tuned model exhibits proficiency in addressing typical pathological questions. We hope that by making both our models and datasets publicly available, they can be valuable to the medical and research communities.

* 10 pages, 2 figures

Via

Access Paper or Ask Questions

What a Whole Slide Image Can Tell? Subtype-guided Masked Transformer for Pathological Image Captioning

Oct 31, 2023

Wenkang Qin, Rui Xu, Peixiang Huang, Xiaomin Wu, Heyu Zhang, Lin Luo

Figure 1 for What a Whole Slide Image Can Tell? Subtype-guided Masked Transformer for Pathological Image Captioning

Figure 2 for What a Whole Slide Image Can Tell? Subtype-guided Masked Transformer for Pathological Image Captioning

Figure 3 for What a Whole Slide Image Can Tell? Subtype-guided Masked Transformer for Pathological Image Captioning

Figure 4 for What a Whole Slide Image Can Tell? Subtype-guided Masked Transformer for Pathological Image Captioning

Abstract:Pathological captioning of Whole Slide Images (WSIs), though is essential in computer-aided pathological diagnosis, has rarely been studied due to the limitations in datasets and model training efficacy. In this paper, we propose a new paradigm Subtype-guided Masked Transformer (SGMT) for pathological captioning based on Transformers, which treats a WSI as a sequence of sparse patches and generates an overall caption sentence from the sequence. An accompanying subtype prediction is introduced into SGMT to guide the training process and enhance the captioning accuracy. We also present an Asymmetric Masked Mechansim approach to tackle the large size constraint of pathological image captioning, where the numbers of sequencing patches in SGMT are sampled differently in the training and inferring phases, respectively. Experiments on the PatchGastricADC22 dataset demonstrate that our approach effectively adapts to the task with a transformer-based model and achieves superior performance than traditional RNN-based methods. Our codes are to be made available for further research and development.

Via

Access Paper or Ask Questions

HTEC: Human Transcription Error Correction

Sep 18, 2023

Hanbo Sun, Jian Gao, Xiaomin Wu, Anjie Fang, Cheng Cao, Zheng Du

Figure 1 for HTEC: Human Transcription Error Correction

Figure 2 for HTEC: Human Transcription Error Correction

Figure 3 for HTEC: Human Transcription Error Correction

Figure 4 for HTEC: Human Transcription Error Correction

Abstract:High-quality human transcription is essential for training and improving Automatic Speech Recognition (ASR) models. Recent study~\cite{libricrowd} has found that every 1% worse transcription Word Error Rate (WER) increases approximately 2% ASR WER by using the transcriptions to train ASR models. Transcription errors are inevitable for even highly-trained annotators. However, few studies have explored human transcription correction. Error correction methods for other problems, such as ASR error correction and grammatical error correction, do not perform sufficiently for this problem. Therefore, we propose HTEC for Human Transcription Error Correction. HTEC consists of two stages: Trans-Checker, an error detection model that predicts and masks erroneous words, and Trans-Filler, a sequence-to-sequence generative model that fills masked positions. We propose a holistic list of correction operations, including four novel operations handling deletion errors. We further propose a variant of embeddings that incorporates phoneme information into the input of the transformer. HTEC outperforms other methods by a large margin and surpasses human annotators by 2.2% to 4.5% in WER. Finally, we deployed HTEC to assist human annotators and showed HTEC is particularly effective as a co-pilot, which improves transcription quality by 15.1% without sacrificing transcription velocity.

* 13 pages, 4 figures, 11 tables, AMLC 2023

Via

Access Paper or Ask Questions