Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yu Sun

Sherman

ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning

Jun 11, 2025

Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Yu Rong, Wenbing Huang, Qifeng Bai, Tingyang Xu

Abstract:Though reasoning-based large language models (LLMs) have excelled in mathematics and programming, their capabilities in knowledge-intensive medical question answering remain underexplored. To address this, we introduce ReasonMed, the largest medical reasoning dataset, comprising 370k high-quality examples distilled from 1.7 million initial reasoning paths generated by various LLMs. ReasonMed is constructed through a \textit{multi-agent verification and refinement process}, where we design an \textit{Error Refiner} to enhance the reasoning paths by identifying and correcting error-prone steps flagged by a verifier. Leveraging ReasonMed, we systematically investigate best practices for training medical reasoning models and find that combining detailed Chain-of-Thought (CoT) reasoning with concise answer summaries yields the most effective fine-tuning strategy. Based on this strategy, we train ReasonMed-7B, which sets a new benchmark for sub-10B models, outperforming the prior best by 4.17\% and even exceeding LLaMA3.1-70B on PubMedQA by 4.60\%.

* 24 pages, 6 figures, 7 tables

Via

Access Paper or Ask Questions

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Jun 08, 2025

LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu(+9 more)

Figure 1 for Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Figure 2 for Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Figure 3 for Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Figure 4 for Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...

* Technical Report, 53 pages, 25 tables, and 16 figures

Via

Access Paper or Ask Questions

Advantageous Parameter Expansion Training Makes Better Large Language Models

May 30, 2025

Naibin Gu, Yilong Chen, Zhenyu Zhang, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang

Figure 1 for Advantageous Parameter Expansion Training Makes Better Large Language Models

Figure 2 for Advantageous Parameter Expansion Training Makes Better Large Language Models

Figure 3 for Advantageous Parameter Expansion Training Makes Better Large Language Models

Figure 4 for Advantageous Parameter Expansion Training Makes Better Large Language Models

Abstract:Although scaling up the number of trainable parameters in both pre-training and fine-tuning can effectively improve the performance of large language models, it also leads to increased computational overhead. When delving into the parameter difference, we find that a subset of parameters, termed advantageous parameters, plays a crucial role in determining model performance. Further analysis reveals that stronger models tend to possess more such parameters. In this paper, we propose Advantageous Parameter EXpansion Training (APEX), a method that progressively expands advantageous parameters into the space of disadvantageous ones, thereby increasing their proportion and enhancing training effectiveness. Further theoretical analysis from the perspective of matrix effective rank explains the performance gains of APEX. Extensive experiments on both instruction tuning and continued pre-training demonstrate that, in instruction tuning, APEX outperforms full-parameter tuning while using only 52% of the trainable parameters. In continued pre-training, APEX achieves the same perplexity level as conventional training with just 33% of the training data, and yields significant improvements on downstream tasks.

Via

Access Paper or Ask Questions

A Lightweight Hybrid Dual Channel Speech Enhancement System under Low-SNR Conditions

May 26, 2025

Zheng Wang, Xiaobin Rong, Yu Sun, Tianchi Sun, Zhibin Lin, Jing Lu

Figure 1 for A Lightweight Hybrid Dual Channel Speech Enhancement System under Low-SNR Conditions

Figure 2 for A Lightweight Hybrid Dual Channel Speech Enhancement System under Low-SNR Conditions

Figure 3 for A Lightweight Hybrid Dual Channel Speech Enhancement System under Low-SNR Conditions

Figure 4 for A Lightweight Hybrid Dual Channel Speech Enhancement System under Low-SNR Conditions

Abstract:Although deep learning based multi-channel speech enhancement has achieved significant advancements, its practical deployment is often limited by constrained computational resources, particularly in low signal-to-noise ratio (SNR) conditions. In this paper, we propose a lightweight hybrid dual-channel speech enhancement system that combines independent vector analysis (IVA) with a modified version of the dual-channel grouped temporal convolutional recurrent network (GTCRN). IVA functions as a coarse estimator, providing auxiliary information for both speech and noise, while the modified GTCRN further refines the speech quality. We investigate several modifications to ensure the comprehensive utilization of both original and auxiliary information. Experimental results demonstrate the effectiveness of the proposed system, achieving enhanced speech with minimal parameters and low computational complexity.

* Accepted by Interspeech 2025

Via

Access Paper or Ask Questions

Super-Resolution Optical Coherence Tomography Using Diffusion Model-Based Plug-and-Play Priors

May 20, 2025

Yaning Wang, Jinglun Yu, Wenhan Guo, Yu Sun, Jin U. Kang

Abstract:We propose an OCT super-resolution framework based on a plug-and-play diffusion model (PnP-DM) to reconstruct high-quality images from sparse measurements (OCT B-mode corneal images). Our method formulates reconstruction as an inverse problem, combining a diffusion prior with Markov chain Monte Carlo sampling for efficient posterior inference. We collect high-speed under-sampled B-mode corneal images and apply a deep learning-based up-sampling pipeline to build realistic training pairs. Evaluations on in vivo and ex vivo fish-eye corneal models show that PnP-DM outperforms conventional 2D-UNet baselines, producing sharper structures and better noise suppression. This approach advances high-fidelity OCT imaging in high-speed acquisition for clinical applications.

Via

Access Paper or Ask Questions

Neural Inverse Scattering with Score-based Regularization

May 20, 2025

Yuan Gao, Wenhan Guo, Yu Sun

Abstract:Inverse scattering is a fundamental challenge in many imaging applications, ranging from microscopy to remote sensing. Solving this problem often requires jointly estimating two unknowns -- the image and the scattering field inside the object -- necessitating effective image prior to regularize the inference. In this paper, we propose a regularized neural field (NF) approach which integrates the denoising score function used in score-based generative models. The neural field formulation offers convenient flexibility to performing joint estimation, while the denoising score function imposes the rich structural prior of images. Our results on three high-contrast simulated objects show that the proposed approach yields a better imaging quality compared to the state-of-the-art NF approach, where regularization is based on total variation.

Via

Access Paper or Ask Questions

Whitened Score Diffusion: A Structured Prior for Imaging Inverse Problems

May 15, 2025

Jeffrey Alido, Tongyu Li, Yu Sun, Lei Tian

Figure 1 for Whitened Score Diffusion: A Structured Prior for Imaging Inverse Problems

Figure 2 for Whitened Score Diffusion: A Structured Prior for Imaging Inverse Problems

Figure 3 for Whitened Score Diffusion: A Structured Prior for Imaging Inverse Problems

Figure 4 for Whitened Score Diffusion: A Structured Prior for Imaging Inverse Problems

Abstract:Conventional score-based diffusion models (DMs) may struggle with anisotropic Gaussian diffusion processes due to the required inversion of covariance matrices in the denoising score matching training objective \cite{vincent_connection_2011}. We propose Whitened Score (WS) diffusion models, a novel SDE-based framework that learns the Whitened Score function instead of the standard score. This approach circumvents covariance inversion, extending score-based DMs by enabling stable training of DMs on arbitrary Gaussian forward noising processes. WS DMs establish equivalence with FM for arbitrary Gaussian noise, allow for tailored spectral inductive biases, and provide strong Bayesian priors for imaging inverse problems with structured noise. We experiment with a variety of computational imaging tasks using the CIFAR and CelebA ($64\times64$) datasets and demonstrate that WS diffusion priors trained on anisotropic Gaussian noising processes consistently outperform conventional diffusion priors based on isotropic Gaussian noise.

Via

Access Paper or Ask Questions

Ultrasound Report Generation with Multimodal Large Language Models for Standardized Texts

May 13, 2025

Peixuan Ge, Tongkun Su, Faqin Lv, Baoliang Zhao, Peng Zhang, Chi Hong Wong, Liang Yao, Yu Sun, Zenan Wang, Pak Kin Wong(+1 more)

Abstract:Ultrasound (US) report generation is a challenging task due to the variability of US images, operator dependence, and the need for standardized text. Unlike X-ray and CT, US imaging lacks consistent datasets, making automation difficult. In this study, we propose a unified framework for multi-organ and multilingual US report generation, integrating fragment-based multilingual training and leveraging the standardized nature of US reports. By aligning modular text fragments with diverse imaging data and curating a bilingual English-Chinese dataset, the method achieves consistent and clinically accurate text generation across organ sites and languages. Fine-tuning with selective unfreezing of the vision transformer (ViT) further improves text-image alignment. Compared to the previous state-of-the-art KMVE method, our approach achieves relative gains of about 2\% in BLEU scores, approximately 3\% in ROUGE-L, and about 15\% in CIDEr, while significantly reducing errors such as missing or incorrect content. By unifying multi-organ and multi-language report generation into a single, scalable framework, this work demonstrates strong potential for real-world clinical workflows.

Via

Access Paper or Ask Questions

Developing Modular Grasping and Manipulation Pipeline Infrastructure to Streamline Performance Benchmarking

Apr 09, 2025

Brian Flynn, Kostas Bekris, Berk Calli, Aaron Dollar, Adam Norton, Yu Sun, Holly Yanco

Abstract:The robot manipulation ecosystem currently faces issues with integrating open-source components and reproducing results. This limits the ability of the community to benchmark and compare the performance of different solutions to one another in an effective manner, instead relying on largely holistic evaluations. As part of the COMPARE Ecosystem project, we are developing modular grasping and manipulation pipeline infrastructure in order to streamline performance benchmarking. The infrastructure will be used towards the establishment of standards and guidelines for modularity and improved open-source development and benchmarking. This paper provides a high-level overview of the architecture of the pipeline infrastructure, experiments conducted to exercise it during development, and future work to expand its modularity.

* IEEE International Conference on Robotics and Automation (ICRA) 2025, Workshop on Robot Software Architectures (RSA25), Atlanta, Georgia, USA, May 2025

Via

Access Paper or Ask Questions

PromptHMR: Promptable Human Mesh Recovery

Apr 08, 2025

Yufu Wang, Yu Sun, Priyanka Patel, Kostas Daniilidis, Michael J. Black, Muhammed Kocabas

Figure 1 for PromptHMR: Promptable Human Mesh Recovery

Figure 2 for PromptHMR: Promptable Human Mesh Recovery

Figure 3 for PromptHMR: Promptable Human Mesh Recovery

Figure 4 for PromptHMR: Promptable Human Mesh Recovery

Abstract:Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary "side information" that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot exploit scene context while methods that process the whole image often fail to detect people and are less accurate than methods that use crops. While recent language-based methods explore HPS reasoning through large language or vision-language models, their metric accuracy is well below the state of the art. In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. Our method processes full images to maintain scene context and accepts multiple input modalities: spatial prompts like bounding boxes and masks, and semantic prompts like language descriptions or interaction labels. PromptHMR demonstrates robust performance across challenging scenarios: estimating people from bounding boxes as small as faces in crowded scenes, improving body shape estimation through language descriptions, modeling person-person interactions, and producing temporally coherent motions in videos. Experiments on benchmarks show that PromptHMR achieves state-of-the-art performance while offering flexible prompt-based control over the HPS estimation process.

Via

Access Paper or Ask Questions