Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Edward Choi

CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays

May 23, 2025

Hyungyung Lee, Geon Choi, Jung-Oh Lee, Hangyul Yoon, Hyuk Gi Hong, Edward Choi

Abstract:Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 10 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench

Via

Access Paper or Ask Questions

PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

May 23, 2025

Daeun Kyung, Hyunseung Chung, Seongsu Bae, Jiho Kim, Jae Ho Sohn, Taerim Kim, Soo Kyung Kim, Edward Choi

Figure 1 for PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Figure 2 for PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Figure 3 for PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Figure 4 for PatientSim: A Persona-Driven Simulator for Realistic Doctor-Patient Interactions

Abstract:Doctor-patient consultations require multi-turn, context-aware communication tailored to diverse patient personas. Training or evaluating doctor LLMs in such settings requires realistic patient interaction systems. However, existing simulators often fail to reflect the full range of personas seen in clinical practice. To address this, we introduce PatientSim, a patient simulator that generates realistic and diverse patient personas for clinical scenarios, grounded in medical expertise. PatientSim operates using: 1) clinical profiles, including symptoms and medical history, derived from real-world data in the MIMIC-ED and MIMIC-IV datasets, and 2) personas defined by four axes: personality, language proficiency, medical history recall level, and cognitive confusion level, resulting in 37 unique combinations. We evaluated eight LLMs for factual accuracy and persona consistency. The top-performing open-source model, Llama 3.3, was validated by four clinicians to confirm the robustness of our framework. As an open-source, customizable platform, PatientSim provides a reproducible and scalable solution that can be customized for specific training needs. Offering a privacy-compliant environment, it serves as a robust testbed for evaluating medical dialogue systems across diverse patient presentations and shows promise as an educational tool for healthcare.

* 9 pages for main text, 4 pages for references, 27 pages for supplementary materials

Via

Access Paper or Ask Questions

Enhancing LLMs' Clinical Reasoning with Real-World Data from a Nationwide Sepsis Registry

May 05, 2025

Junu Kim, Chaeeun Shim, Sungjin Park, Su Yeon Lee, Gee Young Suh, Chae-Man Lim, Seong Jin Choi, Song Mi Moon, Kyoung-Ho Song, Eu Suk Kim(+9 more)

Abstract:Although large language models (LLMs) have demonstrated impressive reasoning capabilities across general domains, their effectiveness in real-world clinical practice remains limited. This is likely due to their insufficient exposure to real-world clinical data during training, as such data is typically not included due to privacy concerns. To address this, we propose enhancing the clinical reasoning capabilities of LLMs by leveraging real-world clinical data. We constructed reasoning-intensive questions from a nationwide sepsis registry and fine-tuned Phi-4 on these questions using reinforcement learning, resulting in C-Reason. C-Reason exhibited strong clinical reasoning capabilities on the in-domain test set, as evidenced by both quantitative metrics and expert evaluations. Furthermore, its enhanced reasoning capabilities generalized to a sepsis dataset involving different tasks and patient cohorts, an open-ended consultations on antibiotics use task, and other diseases. Future research should focus on training LLMs with large-scale, multi-disease clinical datasets to develop more powerful, general-purpose clinical reasoning models.

Via

Access Paper or Ask Questions

LabTOP: A Unified Model for Lab Test Outcome Prediction on Electronic Health Records

Feb 20, 2025

Sujeong Im, Jungwoo Oh, Edward Choi

Figure 1 for LabTOP: A Unified Model for Lab Test Outcome Prediction on Electronic Health Records

Figure 2 for LabTOP: A Unified Model for Lab Test Outcome Prediction on Electronic Health Records

Figure 3 for LabTOP: A Unified Model for Lab Test Outcome Prediction on Electronic Health Records

Figure 4 for LabTOP: A Unified Model for Lab Test Outcome Prediction on Electronic Health Records

Abstract:Lab tests are fundamental for diagnosing diseases and monitoring patient conditions. However, frequent testing can be burdensome for patients, and test results may not always be immediately available. To address these challenges, we propose LabTOP, a unified model that predicts lab test outcomes by leveraging a language modeling approach on EHR data. Unlike conventional methods that estimate only a subset of lab tests or classify discrete value ranges, LabTOP performs continuous numerical predictions for a diverse range of lab items. We evaluate LabTOP on three publicly available EHR datasets and demonstrate that it outperforms existing methods, including traditional machine learning models and state-of-the-art large language models. We also conduct extensive ablation studies to confirm the effectiveness of our design choices. We believe that LabTOP will serve as an accurate and generalizable framework for lab test outcome prediction, with potential applications in clinical decision support and early detection of critical conditions.

* 11 pages for main text, 4 pages for appendix

Via

Access Paper or Ask Questions

R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

Feb 18, 2025

Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi

Figure 1 for R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

Figure 2 for R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

Figure 3 for R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

Figure 4 for R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs

Abstract:Recent studies have combined Large Language Models (LLMs) with Knowledge Graphs (KGs) to enhance reasoning, improving inference accuracy without additional training while mitigating hallucination. However, existing frameworks are often rigid, struggling to adapt to KG or task changes. They also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning. To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor (a high-capacity LLM) that makes final judgments. This design is cost-efficient for LLM inference while still maintaining strong reasoning accuracy. Additionally, R2-KG employs an Abstention mechanism, generating answers only when sufficient evidence is collected from KG, which significantly enhances reliability. Experiments across multiple KG-based reasoning tasks show that R2-KG consistently outperforms baselines in both accuracy and reliability, regardless of the inherent capability of LLMs used as the Operator. Further experiments reveal that the single-agent version of R2-KG, equipped with a strict self-consistency strategy, achieves significantly higher-than-baseline reliability while reducing inference cost. However, it also leads to a higher abstention rate in complex KGs. Our findings establish R2-KG as a flexible and cost-effective solution for KG-based reasoning. It reduces reliance on high-capacity LLMs while ensuring trustworthy inference.

Via

Access Paper or Ask Questions

Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Dec 20, 2024

Sungjin Park, Xiao Liu, Yeyun Gong, Edward Choi

Figure 1 for Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Figure 2 for Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Figure 3 for Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Figure 4 for Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning

Abstract:Despite recent advances in large language models, open-source models often struggle to consistently perform well on complex reasoning tasks. Existing ensemble methods, whether applied at the token or output levels, fail to address these challenges. In response, we present Language model Ensemble with Monte Carlo Tree Search (LE-MCTS), a novel framework for process-level ensembling of language models. LE-MCTS formulates step-by-step reasoning with an ensemble of language models as a Markov decision process. In this framework, states represent intermediate reasoning paths, while actions consist of generating the next reasoning step using one of the language models selected from a predefined pool. Guided by a process-based reward model, LE-MCTS performs a tree search over the reasoning steps generated by different language models, identifying the most accurate reasoning chain. Experimental results on five mathematical reasoning benchmarks demonstrate that our approach outperforms both single language model decoding algorithms and language model ensemble methods. Notably, LE-MCTS improves performance by 3.6% and 4.3% on the MATH and MQA datasets, respectively, highlighting its effectiveness in solving complex reasoning problems.

Via

Access Paper or Ask Questions

Forecasting Future International Events: A Reliable Dataset for Text-Based Event Modeling

Nov 21, 2024

Daehoon Gwak, Junwoo Park, Minho Park, Chaehun Park, Hyunchan Lee, Edward Choi, Jaegul Choo

Figure 1 for Forecasting Future International Events: A Reliable Dataset for Text-Based Event Modeling

Figure 2 for Forecasting Future International Events: A Reliable Dataset for Text-Based Event Modeling

Figure 3 for Forecasting Future International Events: A Reliable Dataset for Text-Based Event Modeling

Figure 4 for Forecasting Future International Events: A Reliable Dataset for Text-Based Event Modeling

Abstract:Predicting future international events from textual information, such as news articles, has tremendous potential for applications in global policy, strategic decision-making, and geopolitics. However, existing datasets available for this task are often limited in quality, hindering the progress of related research. In this paper, we introduce WORLDREP (WORLD Relationship and Event Prediction), a novel dataset designed to address these limitations by leveraging the advanced reasoning capabilities of large-language models (LLMs). Our dataset features high-quality scoring labels generated through advanced prompt modeling and rigorously validated by domain experts in political science. We showcase the quality and utility of WORLDREP for real-world event prediction tasks, demonstrating its effectiveness through extensive experiments and analysis. Furthermore, we publicly release our dataset along with the full automation source code for data collection, labeling, and benchmarking, aiming to support and advance research in text-based event prediction.

* EMNLP 2024 Findings

Via

Access Paper or Ask Questions

Single Ground Truth Is Not Enough: Add Linguistic Variability to Aspect-based Sentiment Analysis Evaluation

Oct 13, 2024

Soyoung Yang, Hojun Cho, Jiyoung Lee, Sohee Yoon, Edward Choi, Jaegul Choo, Won Ik Cho

Figure 1 for Single Ground Truth Is Not Enough: Add Linguistic Variability to Aspect-based Sentiment Analysis Evaluation

Figure 2 for Single Ground Truth Is Not Enough: Add Linguistic Variability to Aspect-based Sentiment Analysis Evaluation

Figure 3 for Single Ground Truth Is Not Enough: Add Linguistic Variability to Aspect-based Sentiment Analysis Evaluation

Figure 4 for Single Ground Truth Is Not Enough: Add Linguistic Variability to Aspect-based Sentiment Analysis Evaluation

Abstract:Aspect-based sentiment analysis (ABSA) is the challenging task of extracting sentiment along with its corresponding aspects and opinions from human language. Due to the inherent variability of natural language, aspect and opinion terms can be expressed in various surface forms, making their accurate identification complex. Current evaluation methods for this task often restrict answers to a single ground truth, penalizing semantically equivalent predictions that differ in surface form. To address this limitation, we propose a novel, fully automated pipeline that augments existing test sets with alternative valid responses for aspect and opinion terms. This approach enables a fairer assessment of language models by accommodating linguistic diversity, resulting in higher human agreement than single-answer test sets (up to 10%p improvement in Kendall's Tau score). Our experimental results demonstrate that Large Language Models (LLMs) show substantial performance improvements over T5 models when evaluated using our augmented test set, suggesting that LLMs' capabilities in ABSA tasks may have been underestimated. This work contributes to a more comprehensive evaluation framework for ABSA, potentially leading to more accurate assessments of model performance in information extraction tasks, particularly those involving span extraction.

* Preprint

Via

Access Paper or Ask Questions

Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records

Sep 11, 2024

Daeun Kyung, Junu Kim, Tackeun Kim, Edward Choi

Figure 1 for Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records

Figure 2 for Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records

Figure 3 for Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records

Figure 4 for Towards Predicting Temporal Changes in a Patient's Chest X-ray Images based on Electronic Health Records

Abstract:Chest X-ray imaging (CXR) is an important diagnostic tool used in hospitals to assess patient conditions and monitor changes over time. Generative models, specifically diffusion-based models, have shown promise in generating realistic synthetic X-rays. However, these models mainly focus on conditional generation using single-time-point data, i.e., typically CXRs taken at a specific time with their corresponding reports, limiting their clinical utility, particularly for capturing temporal changes. To address this limitation, we propose a novel framework, EHRXDiff, which predicts future CXR images by integrating previous CXRs with subsequent medical events, e.g., prescriptions, lab measures, etc. Our framework dynamically tracks and predicts disease progression based on a latent diffusion model, conditioned on the previous CXR image and a history of medical events. We comprehensively evaluate the performance of our framework across three key aspects, including clinical consistency, demographic consistency, and visual realism. We demonstrate that our framework generates high-quality, realistic future images that capture potential temporal changes, suggesting its potential for further development as a clinical simulation tool. This could offer valuable insights for patient monitoring and treatment planning in the medical field.

Via

Access Paper or Ask Questions

Time is Not Enough: Time-Frequency based Explanation for Time-Series Black-Box Models

Aug 07, 2024

Hyunseung Chung, Sumin Jo, Yeonsu Kwon, Edward Choi

Figure 1 for Time is Not Enough: Time-Frequency based Explanation for Time-Series Black-Box Models

Figure 2 for Time is Not Enough: Time-Frequency based Explanation for Time-Series Black-Box Models

Figure 3 for Time is Not Enough: Time-Frequency based Explanation for Time-Series Black-Box Models

Figure 4 for Time is Not Enough: Time-Frequency based Explanation for Time-Series Black-Box Models

Abstract:Despite the massive attention given to time-series explanations due to their extensive applications, a notable limitation in existing approaches is their primary reliance on the time-domain. This overlooks the inherent characteristic of time-series data containing both time and frequency features. In this work, we present Spectral eXplanation (SpectralX), an XAI framework that provides time-frequency explanations for time-series black-box classifiers. This easily adaptable framework enables users to "plug-in" various perturbation-based XAI methods for any pre-trained time-series classification models to assess their impact on the explanation quality without having to modify the framework architecture. Additionally, we introduce Feature Importance Approximations (FIA), a new perturbation-based XAI method. These methods consist of feature insertion, deletion, and combination techniques to enhance computational efficiency and class-specific explanations in time-series classification tasks. We conduct extensive experiments in the generated synthetic dataset and various UCR Time-Series datasets to first compare the explanation performance of FIA and other existing perturbation-based XAI methods in both time-domain and time-frequency domain, and then show the superiority of our FIA in the time-frequency domain with the SpectralX framework. Finally, we conduct a user study to confirm the practicality of our FIA in SpectralX framework for class-specific time-frequency based time-series explanations. The source code is available in https://github.com/gustmd0121/Time_is_not_Enough

* Accepted to CIKM 2024 (10 pages, 4 figures, 6 tables)

Via

Access Paper or Ask Questions