Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ming Li

School of Integrated Circuits, Peking University

CLS-RL: Image Classification with Rule-Based Reinforcement Learning

Mar 20, 2025

Ming Li, Shitian Zhao, Jike Zhong, Yuxiang Lai, Kaipeng Zhang

Figure 1 for CLS-RL: Image Classification with Rule-Based Reinforcement Learning

Figure 2 for CLS-RL: Image Classification with Rule-Based Reinforcement Learning

Figure 3 for CLS-RL: Image Classification with Rule-Based Reinforcement Learning

Figure 4 for CLS-RL: Image Classification with Rule-Based Reinforcement Learning

Abstract:Classification is a core task in machine learning. Recent research has shown that although Multimodal Large Language Models (MLLMs) are initially poor at image classification, fine-tuning them with an adequate amount of data can significantly enhance their performance, making them comparable to SOTA classification models. However, acquiring large-scale labeled data is expensive. In this paper, we explore few-shot MLLM classification fine-tuning. We found that SFT can cause severe overfitting issues and may even degrade performance over the zero-shot approach. To address this challenge, inspired by the recent successes in rule-based reinforcement learning, we propose CLS-RL, which uses verifiable signals as reward to fine-tune MLLMs. We discovered that CLS-RL outperforms SFT in most datasets and has a much higher average accuracy on both base-to-new and few-shot learning setting. Moreover, we observed a free-lunch phenomenon for CLS-RL; when models are fine-tuned on a particular dataset, their performance on other distinct datasets may also improve over zero-shot models, even if those datasets differ in distribution and class names. This suggests that RL-based methods effectively teach models the fundamentals of classification. Lastly, inspired by recent works in inference time thinking, we re-examine the `thinking process' during fine-tuning, a critical aspect of RL-based methods, in the context of visual classification. We question whether such tasks require extensive thinking process during fine-tuning, proposing that this may actually detract from performance. Based on this premise, we introduce the No-Thinking-CLS-RL method, which minimizes thinking processes during training by setting an equality accuracy reward. Our findings indicate that, with much less fine-tuning time, No-Thinking-CLS-RL method achieves superior in-domain performance and generalization capabilities than CLS-RL.

* Preprint, work in progress

Via

Access Paper or Ask Questions

Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models

Mar 19, 2025

Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, Xiaofeng Yang

Figure 1 for Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models

Figure 2 for Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models

Figure 3 for Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models

Figure 4 for Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models

Abstract:Vision-language models (VLMs) have advanced reasoning in natural scenes, but their role in medical imaging remains underexplored. Medical reasoning tasks demand robust image analysis and well-justified answers, posing challenges due to the complexity of medical images. Transparency and trustworthiness are essential for clinical adoption and regulatory compliance. We introduce Med-R1, a framework exploring reinforcement learning (RL) to enhance VLMs' generalizability and trustworthiness in medical reasoning. Leveraging the DeepSeek strategy, we employ Group Relative Policy Optimization (GRPO) to guide reasoning paths via reward signals. Unlike supervised fine-tuning (SFT), which often overfits and lacks generalization, RL fosters robust and diverse reasoning. Med-R1 is evaluated across eight medical imaging modalities: CT, MRI, Ultrasound, Dermoscopy, Fundus Photography, Optical Coherence Tomography (OCT), Microscopy, and X-ray Imaging. Compared to its base model, Qwen2-VL-2B, Med-R1 achieves a 29.94% accuracy improvement and outperforms Qwen2-VL-72B, which has 36 times more parameters. Testing across five question types-modality recognition, anatomy identification, disease diagnosis, lesion grading, and biological attribute analysis Med-R1 demonstrates superior generalization, exceeding Qwen2-VL-2B by 32.06% and surpassing Qwen2-VL-72B in question-type generalization. These findings show that RL improves medical reasoning and enables parameter-efficient models to outperform significantly larger ones. With interpretable reasoning outputs, Med-R1 represents a promising step toward generalizable, trustworthy, and clinically viable medical VLMs.

Via

Access Paper or Ask Questions

Where do Large Vision-Language Models Look at when Answering Questions?

Mar 18, 2025

Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, Sijie Zhu

Abstract:Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.

Via

Access Paper or Ask Questions

Joint Array Partitioning and Beamforming Designs in ISAC Systems: A Bayesian CRB Perspective

Mar 18, 2025

Rang Liu, Ming Li, A. Lee Swindlehurst

Abstract:Integrated sensing and communication (ISAC) has emerged as a promising paradigm for next-generation (6G) wireless networks, unifying radar sensing and communication on a shared hardware platform. This paper proposes a dynamic array partitioning framework for monostatic ISAC systems to fully exploit available spatial degrees of freedom (DoFs) and reconfigurable antenna topologies, enhancing sensing performance in complex scenarios. We first establish a theoretical foundation for our work by deriving Bayesian Cram\'{e}r-Rao bounds (BCRBs) under prior distribution constraints for heterogeneous target models, encompassing both point-like and extended targets. Building on this, we formulate a joint optimization framework for transmit beamforming and dynamic array partitioning to minimize the derived BCRBs for direction-of-arrival (DOA) estimation. The optimization problem incorporates practical constraints, including multi-user communication signal-to-interference-plus-noise ratio (SINR) requirements, transmit power budgets, and array partitioning feasibility conditions. To address the non-convexity of the problem, we develop an efficient alternating optimization algorithm combining the alternating direction method of multipliers (ADMM) with semi-definite relaxation (SDR). We also design novel maximum a posteriori (MAP) DOA estimation algorithms specifically adapted to the statistical characteristics of each target model. Extensive simulations illustrate the superiority of the proposed dynamic partitioning strategy over conventional fixed-array architectures across diverse system configurations.

* 13 pages, 10 figures, submitted to IEEE journal

Via

Access Paper or Ask Questions

Low Range-Doppler Sidelobe ISAC Waveform Design: A Low-Complexity Approach

Mar 15, 2025

Peishi Li, Ming Li, Rang Liu, Qian Liu, A. Lee Swindlehurst

Abstract:Integrated sensing and communication (ISAC) is a pivotal enabler for next-generation wireless networks. A key challenge in ISAC systems lies in designing dual-functional waveforms that can achieve satisfactory radar sensing accuracy by effectively suppressing range-Doppler sidelobes. However, existing solutions are often computationally intensive, limiting their practicality in multi-input multi-output (MIMO) orthogonal frequency division multiplexing (OFDM) ISAC deployments. This paper presents a novel low-complexity algorithm leveraging the augmented Lagrangian method (ALM) and Riemannian conjugate gradient (RCG) optimization techniques to address these challenges. The proposed algorithm achieves superior sidelobe suppression compared to state-of-the-art methods while dramatically reducing computational complexity, making it highly suitable for real-world MIMO-OFDM ISAC systems. Simulation results demonstrate that the proposed approach not only outperforms existing benchmarks in sidelobe reduction but also accelerates convergence, ensuring efficient performance across communication and sensing tasks.

* submitted to IEEE TVT

Via

Access Paper or Ask Questions

Safe-VAR: Safe Visual Autoregressive Model for Text-to-Image Generative Watermarking

Mar 14, 2025

Ziyi Wang, Songbai Tan, Gang Xu, Xuerui Qiu, Hongbin Xu, Xin Meng, Ming Li, Fei Richard Yu

Abstract:With the success of autoregressive learning in large language models, it has become a dominant approach for text-to-image generation, offering high efficiency and visual quality. However, invisible watermarking for visual autoregressive (VAR) models remains underexplored, despite its importance in misuse prevention. Existing watermarking methods, designed for diffusion models, often struggle to adapt to the sequential nature of VAR models. To bridge this gap, we propose Safe-VAR, the first watermarking framework specifically designed for autoregressive text-to-image generation. Our study reveals that the timing of watermark injection significantly impacts generation quality, and watermarks of different complexities exhibit varying optimal injection times. Motivated by this observation, we propose an Adaptive Scale Interaction Module, which dynamically determines the optimal watermark embedding strategy based on the watermark information and the visual characteristics of the generated image. This ensures watermark robustness while minimizing its impact on image quality. Furthermore, we introduce a Cross-Scale Fusion mechanism, which integrates mixture of both heads and experts to effectively fuse multi-resolution features and handle complex interactions between image content and watermark patterns. Experimental results demonstrate that Safe-VAR achieves state-of-the-art performance, significantly surpassing existing counterparts regarding image quality, watermarking fidelity, and robustness against perturbations. Moreover, our method exhibits strong generalization to an out-of-domain watermark dataset QR Codes.

Via

Access Paper or Ask Questions

FaVChat: Unlocking Fine-Grained Facial Video Understanding with Multimodal Large Language Models

Mar 13, 2025

Fufangchen Zhao, Ming Li, Linrui Xu, Wenhao Jiang, Jian Gao, Danfeng Yan

Figure 1 for FaVChat: Unlocking Fine-Grained Facial Video Understanding with Multimodal Large Language Models

Figure 2 for FaVChat: Unlocking Fine-Grained Facial Video Understanding with Multimodal Large Language Models

Figure 3 for FaVChat: Unlocking Fine-Grained Facial Video Understanding with Multimodal Large Language Models

Figure 4 for FaVChat: Unlocking Fine-Grained Facial Video Understanding with Multimodal Large Language Models

Abstract:Video-based multimodal large language models (VMLLMs) have demonstrated remarkable potential in cross-modal video understanding. However, their abilities in fine-grained face comprehension remain largely underexplored. Given its pivotal role in human-centric intelligence, developing VMLLMs for facial understanding holds a fundamental problem. To address this gap, we propose FaVChat, the first VMLLM specifically designed for fine-grained facial video understanding. To facilitate its training, we construct a large-scale facial video dataset comprising over 60k videos, with the majority annotated with 83 fine-grained facial attributes. These attributes are incorporated to enrich GPT-4o-generated captions, yielding 60k high-quality video-summary pairs and an additional 170k fine-grained question-answering (QA) pairs. To effectively capture rich facial clues, we propose a hybrid model architecture composed of a general visual encoder, a dedicated facial encoder, and a mixture-of-experts-enhanced adapter for adaptive fusion of multi-source visual features. To mitigate information loss during feature transformation, we extract multi-granularity representations from the facial encoder and integrate them into the subsequent LLM. This design enhances the model's ability to comprehend and respond to questions involving diverse levels of visual details. We employ a progressive training paradigm, transitioning from video summarization to a high-quality subset of video QA, gradually increasing task complexity to enhance the model's fine-grained visual perception. We conduct extensive zero-shot evaluation on a couple of public benchmarks, demonstrating that FaVChat consistently surpasses existing VMLLMs across multiple tasks.

Via

Access Paper or Ask Questions

Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models

Mar 11, 2025

Junzhe Li, Xuerui Qiu, Linrui Xu, Liya Guo, Delin Qu, Tingting Long, Chun Fan, Ming Li

$Figure 1 for Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models$

$Figure 2 for Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models$

$Figure 3 for Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models$

$Figure 4 for Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models$

Abstract:Unified multimodal models (UMMs) have emerged as a powerful paradigm in foundational computer vision research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily focuses on $\textbf{coarse}$ facial attribute understanding, with limited capacity to handle $\textbf{fine-grained}$ facial attributes and without addressing generation capabilities. To overcome these limitations, we propose Uni$\textbf{F}^2$ace, the first UMM tailored specifically for fine-grained face understanding and generation. In general, we train Uni$\textbf{F}^2$ace on a self-constructed, specialized dataset utilizing two mutually beneficial diffusion techniques and a two-level mixture-of-experts architecture. Specifically, we first build a large-scale facial dataset, Uni$\textbf{F}^2$ace-130K, which contains 130K image-text pairs with one million question-answering pairs that span a wide range of facial attributes. Second, we establish a theoretical connection between discrete diffusion score matching and masked generative models, optimizing both evidence lower bounds simultaneously, which significantly improves the model's ability to synthesize facial details. Finally, we introduce both token-level and sequence-level mixture-of-experts, enabling efficient fine-grained representation learning for both understanding and generation tasks. Extensive experiments on Uni$\textbf{F}^2$ace-130K demonstrate that Uni$\textbf{F}^2$ace outperforms existing UMMs and generative models, achieving superior performance across both understanding and generation tasks.

Via

Access Paper or Ask Questions

AnomalyPainter: Vision-Language-Diffusion Synergy for Zero-Shot Realistic and Diverse Industrial Anomaly Synthesis

Mar 11, 2025

Zhangyu Lai, Yilin Lu, Xinyang Li, Jianghang Lin, Yansong Qu, Liujuan Cao, Ming Li, Rongrong Ji

Abstract:While existing anomaly synthesis methods have made remarkable progress, achieving both realism and diversity in synthesis remains a major obstacle. To address this, we propose AnomalyPainter, a zero-shot framework that breaks the diversity-realism trade-off dilemma through synergizing Vision Language Large Model (VLLM), Latent Diffusion Model (LDM), and our newly introduced texture library Tex-9K. Tex-9K is a professional texture library containing 75 categories and 8,792 texture assets crafted for diverse anomaly synthesis. Leveraging VLLM's general knowledge, reasonable anomaly text descriptions are generated for each industrial object and matched with relevant diverse textures from Tex-9K. These textures then guide the LDM via ControlNet to paint on normal images. Furthermore, we introduce Texture-Aware Latent Init to stabilize the natural-image-trained ControlNet for industrial images. Extensive experiments show that AnomalyPainter outperforms existing methods in realism, diversity, and generalization, achieving superior downstream performance.

* anomaly synthesis,anomaly detection

Via

Access Paper or Ask Questions

Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models

Mar 11, 2025

Weiguo Gao, Ming Li

Figure 1 for Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models

Figure 2 for Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models

Figure 3 for Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models

Figure 4 for Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models

Abstract:The increasing prevalence of synthetic data in training loops has raised concerns about model collapse, where generative models degrade when trained on their own outputs. While prior work focuses on this self-consuming process, we study an underexplored yet prevalent phenomenon: co-evolving generative models that shape each other's training through iterative feedback. This is common in multimodal AI ecosystems, such as social media platforms, where text models generate captions that guide image models, and the resulting images influence the future adaptation of the text model. We take a first step by analyzing such a system, modeling the text model as a multinomial distribution and the image model as a conditional multi-dimensional Gaussian distribution. Our analysis uncovers three key results. First, when one model remains fixed, the other collapses: a frozen image model causes the text model to lose diversity, while a frozen text model leads to an exponential contraction of image diversity, though fidelity remains bounded. Second, in fully interactive systems, mutual reinforcement accelerates collapse, with image contraction amplifying text homogenization and vice versa, leading to a Matthew effect where dominant texts sustain higher image diversity while rarer texts collapse faster. Third, we analyze stabilization strategies implicitly introduced by real-world external influences. Random corpus injections for text models and user-content injections for image models prevent collapse while preserving both diversity and fidelity. Our theoretical findings are further validated through experiments.

* 37 pages, 11 figures

Via

Access Paper or Ask Questions