Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yi Hong

Sky-Ear: An Unmanned Aerial Vehicle-Enabled Victim Sound Detection and Localization System

Apr 14, 2026

Yi Hong, Mingyang Wang, Yalin Liu, Yaru Fu, Kevin Hung

Abstract:Unmanned Aerial Vehicles (UAVs) are increasingly deployed in search-and-rescue (SAR) missions, yet continuous and reliable victim detection and localization remain challenging due to on-board hardware constraints. This paper designs an UAV-Enabled Victim Sound Detection and Localization System (called ``Sky-Ear'' for brevity) to achieve energy-efficient acoustic sensing and sound detection for SAR. Based on a circular-shaped microphone array, two-stage (Sentinel and Responder) audio processing is developed for energy-consuming and highly reliable sound detection. A Masking autoencoder (MAE)-based sound detection method is designed in the Sentinel stage to analyze frequency-time acoustic features. For improved precision, a continuous localization method is designed by optimizing detected directions from multiple observations. Extensive simulation experiments are conducted to validate the system's performance in terms of victim detection accuracy and localization error.

Via

Access Paper or Ask Questions

AD-Reasoning: Multimodal Guideline-Guided Reasoning for Alzheimer's Disease Diagnosis

Mar 25, 2026

Qiuhui Chen, Yushan Deng, Xuancheng Yao, Yi Hong

Abstract:Alzheimer's disease (AD) diagnosis requires integrating neuroimaging with heterogeneous clinical evidence and reasoning under established criteria, yet most multimodal models remain opaque and weakly guideline-aligned. We present AD-Reasoning, a multimodal framework that couples structural MRI with six clinical modalities and a rule-based verifier to generate structured, NIA-AA-consistent diagnoses. AD-Reasoning combines modality-specific encoders, bidirectional cross-attention fusion, and reinforcement fine-tuning with verifiable rewards that enforce output format, guideline evidence coverage, and reasoning--decision consistency. We also release AD-MultiSense, a 10,378-visit multimodal QA dataset with guideline-validated rationales built from ADNI/AIBL. On AD-MultiSense, AD-Reasoning achieves state-of-the-art diagnostic accuracy and produces structured rationales that improve transparency over recent baselines, while providing transparent rationales.

* ICME 2026

Via

Access Paper or Ask Questions

EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease

Feb 22, 2026

Qiuhui Chen, Xuancheng Yao, Zhenglei Zhou, Xinyue Hu, Yi Hong

Abstract:Deep learning models for medical image analysis often act as black boxes, seldom aligning with clinical guidelines or explicitly linking decisions to supporting evidence. This is especially critical in Alzheimer's disease (AD), where predictions should be grounded in both anatomical and clinical findings. We present EMAD, a vision-language framework that generates structured AD diagnostic reports in which each claim is explicitly grounded in multimodal evidence. EMAD uses a hierarchical Sentence-Evidence-Anatomy (SEA) grounding mechanism: (i) sentence-to-evidence grounding links generated sentences to clinical evidence phrases, and (ii) evidence-to-anatomy grounding localizes corresponding structures on 3D brain MRI. To reduce dense annotation requirements, we propose GTX-Distill, which transfers grounding behavior from a teacher trained with limited supervision to a student operating on model-generated reports. We further introduce Executable-Rule GRPO, a reinforcement fine-tuning scheme with verifiable rewards that enforces clinical consistency, protocol adherence, and reasoning-diagnosis coherence. On the AD-MultiSense dataset, EMAD achieves state-of-the-art diagnostic accuracy and produces more transparent, anatomically faithful reports than existing methods. We will release code and grounding annotations to support future research in trustworthy medical vision-language models.

* Accepted by CVPR2026

Via

Access Paper or Ask Questions

Harmonizing Generalization and Specialization: Uncertainty-Informed Collaborative Learning for Semi-supervised Medical Image Segmentation

Dec 15, 2025

Wenjing Lu, Yi Hong, Yang Yang

Abstract:Vision foundation models have demonstrated strong generalization in medical image segmentation by leveraging large-scale, heterogeneous pretraining. However, they often struggle to generalize to specialized clinical tasks under limited annotations or rare pathological variations, due to a mismatch between general priors and task-specific requirements. To address this, we propose Uncertainty-informed Collaborative Learning (UnCoL), a dual-teacher framework that harmonizes generalization and specialization in semi-supervised medical image segmentation. Specifically, UnCoL distills both visual and semantic representations from a frozen foundation model to transfer general knowledge, while concurrently maintaining a progressively adapting teacher to capture fine-grained and task-specific representations. To balance guidance from both teachers, pseudo-label learning in UnCoL is adaptively regulated by predictive uncertainty, which selectively suppresses unreliable supervision and stabilizes learning in ambiguous regions. Experiments on diverse 2D and 3D segmentation benchmarks show that UnCoL consistently outperforms state-of-the-art semi-supervised methods and foundation model baselines. Moreover, our model delivers near fully supervised performance with markedly reduced annotation requirements.

* This work has been submitted to the IEEE TMI for possible publication

Via

Access Paper or Ask Questions

Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models

Sep 11, 2025

Qiuhui Chen, Xuancheng Yao, Huping Ye, Yi Hong

Abstract:Understanding 3D medical image volumes is critical in the medical field, yet existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension. Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions. To leverage these 2D MLLMs for improved 3D medical image understanding, we propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module. Additionally, our model employs a partial optimal transport based alignment, demonstrating greater tolerance to noise introduced by potential noises in LLM-generated content. Med3DInsight introduces a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations. Extensive experiments demonstrate our state-of-the-art performance on two downstream tasks, i.e., segmentation and classification, across various public datasets with CT and MRI modalities, outperforming current SSL methods. Med3DInsight can be seamlessly integrated into existing 3D medical image understanding networks, potentially enhancing their performance. Our source code, generated datasets, and pre-trained models will be available at https://github.com/Qybc/Med3DInsight.

* Accepted by IEEE Journal of Biomedical and Health Informatics (JBHI)

Via

Access Paper or Ask Questions

HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimer's Disease

Apr 27, 2025

Qiuhui Chen, Jintao Wang, Gang Wang, Yi Hong

Figure 1 for HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimer's Disease

Figure 2 for HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimer's Disease

Figure 3 for HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimer's Disease

Figure 4 for HoloDx: Knowledge- and Data-Driven Multimodal Diagnosis of Alzheimer's Disease

Abstract:Accurate diagnosis of Alzheimer's disease (AD) requires effectively integrating multimodal data and clinical expertise. However, existing methods often struggle to fully utilize multimodal information and lack structured mechanisms to incorporate dynamic domain knowledge. To address these limitations, we propose HoloDx, a knowledge- and data-driven framework that enhances AD diagnosis by aligning domain knowledge with multimodal clinical data. HoloDx incorporates a knowledge injection module with a knowledge-aware gated cross-attention, allowing the model to dynamically integrate domain-specific insights from both large language models (LLMs) and clinical expertise. Also, a memory injection module with a designed prototypical memory attention enables the model to retain and retrieve subject-specific information, ensuring consistency in decision-making. By jointly leveraging these mechanisms, HoloDx enhances interpretability, improves robustness, and effectively aligns prior knowledge with current subject data. Evaluations on five AD datasets demonstrate that HoloDx outperforms state-of-the-art methods, achieving superior diagnostic accuracy and strong generalization across diverse cohorts. The source code will be released upon publication acceptance.

Via

Access Paper or Ask Questions

An Event-centric Framework for Predicting Crime Hotspots with Flexible Time Intervals

Nov 02, 2024

Jiahui Jin, Yi Hong, Guandong Xu, Jinghui Zhang, Jun Tang, Hancheng Wang

Figure 1 for An Event-centric Framework for Predicting Crime Hotspots with Flexible Time Intervals

Figure 2 for An Event-centric Framework for Predicting Crime Hotspots with Flexible Time Intervals

Figure 3 for An Event-centric Framework for Predicting Crime Hotspots with Flexible Time Intervals

Figure 4 for An Event-centric Framework for Predicting Crime Hotspots with Flexible Time Intervals

Abstract:Predicting crime hotspots in a city is a complex and critical task with significant societal implications. Numerous spatiotemporal correlations and irregularities pose substantial challenges to this endeavor. Existing methods commonly employ fixed-time granularities and sequence prediction models. However, determining appropriate time granularities is difficult, leading to inaccurate predictions for specific time windows. For example, users might ask: What are the crime hotspots during 12:00-20:00? To address this issue, we introduce FlexiCrime, a novel event-centric framework for predicting crime hotspots with flexible time intervals. FlexiCrime incorporates a continuous-time attention network to capture correlations between crime events, which learns crime context features, representing general crime patterns across time points and locations. Furthermore, we introduce a type-aware spatiotemporal point process that learns crime-evolving features, measuring the risk of specific crime types at a given time and location by considering the frequency of past crime events. The crime context and evolving features together allow us to predict whether an urban area is a crime hotspot given a future time interval. To evaluate FlexiCrime's effectiveness, we conducted experiments using real-world datasets from two cities, covering twelve crime types. The results show that our model outperforms baseline techniques in predicting crime hotspots over flexible time intervals.

* 21 pages, 12 figures

Via

Access Paper or Ask Questions

NODER: Image Sequence Regression Based on Neural Ordinary Differential Equations

Jul 18, 2024

Hao Bai, Yi Hong

Figure 1 for NODER: Image Sequence Regression Based on Neural Ordinary Differential Equations

Figure 2 for NODER: Image Sequence Regression Based on Neural Ordinary Differential Equations

Figure 3 for NODER: Image Sequence Regression Based on Neural Ordinary Differential Equations

Figure 4 for NODER: Image Sequence Regression Based on Neural Ordinary Differential Equations

Abstract:Regression on medical image sequences can capture temporal image pattern changes and predict images at missing or future time points. However, existing geodesic regression methods limit their regression performance by a strong underlying assumption of linear dynamics, while diffusion-based methods have high computational costs and lack constraints to preserve image topology. In this paper, we propose an optimization-based new framework called NODER, which leverages neural ordinary differential equations to capture complex underlying dynamics and reduces its high computational cost of handling high-dimensional image volumes by introducing the latent space. We compare our NODER with two recent regression methods, and the experimental results on ADNI and ACDC datasets demonstrate that our method achieves the state-of-the-art performance in 3D image regression. Our model needs only a couple of images in a sequence for prediction, which is practical, especially for clinical situations where extremely limited image time series are available for analysis. Our source code is available at https://github.com/ZedKing12138/NODER-pytorch.

* MICCAI2024

Via

Access Paper or Ask Questions

OTFS Channel Estimation and Detection for Channels with Very Large Delay Spread

Apr 12, 2024

Preety Priya, Yi Hong, Emanuele Viterbo

Abstract:In low latency applications and in general, for overspread channels, channel delay spread is a large percentage of the transmission frame duration. In this paper, we consider OTFS in an overspread channel exhibiting a delay spread that exceeds the block duration in a frame, where traditional channel estimation (CE) fails. We propose a two-stage CE method based on a delay-Doppler (DD) training frame, consisting of a dual chirp converted from time domain and a higher power pilot. The first stage employs a DD domain embedded pilot CE to estimate the aliased delays (due to modulo operation) and Doppler shifts, followed by identifying all the underspread paths not coinciding with any overspread path. The second stage utilizes time domain dual chirp correlation to estimate the actual delays and Doppler shifts of the remaining paths. This stage also resolves ambiguity in estimating delays and Doppler shifts for paths sharing same aliased delay. Furthermore, we present a modified low-complexity maximum ratio combining (MRC) detection algorithm for OTFS in overspread channels. Finally, we evaluate performance of OTFS using the proposed CE and the modified MRC detection in terms of normalized mean square error (NMSE) and bit error rate (BER).

Via

Access Paper or Ask Questions

Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Mar 08, 2024

Qiuhui Chen, Huping Ye, Yi Hong

Figure 1 for Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Figure 2 for Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Figure 3 for Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Figure 4 for Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models

Abstract:Understanding 3D medical image volumes is a critical task in the medical domain. However, existing 3D convolution and transformer-based methods have limited semantic understanding of an image volume and also need a large set of volumes for training. Recent advances in multi-modal large language models (MLLMs) provide a new and promising way to understand images with the help of text descriptions. However, most current MLLMs are designed for 2D natural images. To enhance the 3D medical image understanding with 2D MLLMs, we propose a novel pre-training framework called Med3DInsight, which marries existing 3D image encoders with 2D MLLMs and bridges them via a designed Plane-Slice-Aware Transformer (PSAT) module. Extensive experiments demonstrate our SOTA performance on two downstream segmentation and classification tasks, including three public datasets with CT and MRI modalities and comparison to more than ten baselines. Med3DInsight can be easily integrated into any current 3D medical image understanding network and improves its performance by a good margin.

Via

Access Paper or Ask Questions