Picture for Xinwei Peng

Xinwei Peng

MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

Add code
Jun 25, 2026
Viaarxiv icon

MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

Add code
Nov 19, 2025
Viaarxiv icon

MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

Add code
Oct 31, 2025
Viaarxiv icon

Benchmarking Ethical and Safety Risks of Healthcare LLMs in China-Toward Systemic Governance under Healthy China 2030

Add code
May 12, 2025
Figure 1 for Benchmarking Ethical and Safety Risks of Healthcare LLMs in China-Toward Systemic Governance under Healthy China 2030
Viaarxiv icon

Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies

Add code
Mar 10, 2025
Figure 1 for Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies
Figure 2 for Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies
Figure 3 for Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies
Figure 4 for Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies
Viaarxiv icon

MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine

Add code
May 12, 2023
Figure 1 for MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine
Figure 2 for MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine
Figure 3 for MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine
Figure 4 for MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine
Viaarxiv icon