Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wataru Kawakami

Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization

Apr 25, 2025

Wataru Kawakami, Keita Suzuki, Junichiro Iwasawa

Figure 1 for Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization

Figure 2 for Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization

Figure 3 for Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization

Figure 4 for Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization

Abstract:Large Language Models (LLMs) show potential in medicine, yet clinical adoption is hindered by concerns over factual accuracy, language-specific limitations (e.g., Japanese), and critically, their reliability when required to generate reasoning explanations -- a prerequisite for trust. This paper introduces Preferred-MedLLM-Qwen-72B, a 72B-parameter model optimized for the Japanese medical domain to achieve both high accuracy and stable reasoning. We employ a two-stage fine-tuning process on the Qwen2.5-72B base model: first, Continued Pretraining (CPT) on a comprehensive Japanese medical corpus instills deep domain knowledge. Second, Reasoning Preference Optimization (RPO), a preference-based method, enhances the generation of reliable reasoning pathways while preserving high answer accuracy. Evaluations on the Japanese Medical Licensing Exam benchmark (IgakuQA) show Preferred-MedLLM-Qwen-72B achieves state-of-the-art performance (0.868 accuracy), surpassing strong proprietary models like GPT-4o (0.866). Crucially, unlike baseline or CPT-only models which exhibit significant accuracy degradation (up to 11.5\% and 3.8\% respectively on IgakuQA) when prompted for explanations, our model maintains its high accuracy (0.868) under such conditions. This highlights RPO's effectiveness in stabilizing reasoning generation. This work underscores the importance of optimizing for reliable explanations alongside accuracy. We release the Preferred-MedLLM-Qwen-72B model weights to foster research into trustworthy LLMs for specialized, high-stakes applications.

Via

Access Paper or Ask Questions