Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ting Chen

YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models

May 29, 2026

Ting Chen, Geng Li, Guohao Chen, Yu Hu, Guan Huang, Mai Chen, Langsheng Lei, Jun Du

Abstract:Contrastive decoding (CD) seeks to mitigate hallucinations in Large Vision-Language Models (LVLMs) by contrasting the output distributions of a standard model and a visually degraded model. However, existing training-free CD methods suffer from sub-optimal degraded branches: completely dropping visual tokens is too extreme and induces language hallucinations, while corrupting input images offers coarse control over visual evidence and suffers from high inference latency due to requiring two full forward passes. To address these dilemmas, we propose YARD, a training-free Y-Architecture Register Decoding framework. Motivated by the observation that reliable text-to-vision grounding predominantly emerges in the middle decoder layers, YARD constructs the degraded branch internally by sharing shallow-layer computations and branching exactly at this critical stage. For the degraded branch, YARD replaces patch-level visual tokens with register tokens, which preserve global image semantics but lack fine-grained local evidence. This image-aware yet locally under-grounded design provides a faithful contrastive signal without extreme modality mismatch, while the Y-architecture strictly avoids a costly second forward pass. Extensive experiments on generative and discriminative hallucination benchmarks demonstrate that YARD consistently achieves state-of-the-art hallucination mitigation across multiple LVLMs, alongside a significant reduction in inference latency.

* 21 pages, 11 figures

Via

Access Paper or Ask Questions

OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

May 28, 2026

Geng Li, Guohao Chen, Ting Chen, Shilin Shan, Kuangji Zuo, Bofan Lyu, Tuo An, Gen Li, Jianfei Yang

Abstract:Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a training-free framework that replaces absolute token ranking with register-anchored relative evidence testing. Instead of asking which tokens are globally important, OccamToken evaluates whether a visual token provides information beyond a register-based reference. Our key insight is that register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence. Based on this principle, OccamToken performs both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention. Across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off without additional training. Notably, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40 while preserving over 93% of the original accuracy, enabling stable visual token compression even in the extreme 1.4% retention regime.

* 26 pages,8 figures

Via

Access Paper or Ask Questions

ECGFlowCMR: Pretraining with ECG-Generated Cine CMR Improves Cardiac Disease Classification and Phenotype Prediction

Jan 28, 2026

Xiaocheng Fang, Zhengyao Ding, Jieyi Cai, Yujie Xiao, Bo Liu, Jiarui Jin, Haoyu Wang, Guangkun Nie, Shun Huang, Ting Chen(+2 more)

Abstract:Cardiac Magnetic Resonance (CMR) imaging provides a comprehensive assessment of cardiac structure and function but remains constrained by high acquisition costs and reliance on expert annotations, limiting the availability of large-scale labeled datasets. In contrast, electrocardiograms (ECGs) are inexpensive, widely accessible, and offer a promising modality for conditioning the generative synthesis of cine CMR. To this end, we propose ECGFlowCMR, a novel ECG-to-CMR generative framework that integrates a Phase-Aware Masked Autoencoder (PA-MAE) and an Anatomy-Motion Disentangled Flow (AMDF) to address two fundamental challenges: (1) the cross-modal temporal mismatch between multi-beat ECG recordings and single-cycle CMR sequences, and (2) the anatomical observability gap due to the limited structural information inherent in ECGs. Extensive experiments on the UK Biobank and a proprietary clinical dataset demonstrate that ECGFlowCMR can generate realistic cine CMR sequences from ECG inputs, enabling scalable pretraining and improving performance on downstream cardiac disease classification and phenotype prediction tasks.

Via

Access Paper or Ask Questions

Data Augmentation in Time Series Forecasting through Inverted Framework

Jul 16, 2025

Hongming Tan, Ting Chen, Ruochong Jin, Wai Kin Chan

Figure 1 for Data Augmentation in Time Series Forecasting through Inverted Framework

Figure 2 for Data Augmentation in Time Series Forecasting through Inverted Framework

Figure 3 for Data Augmentation in Time Series Forecasting through Inverted Framework

Figure 4 for Data Augmentation in Time Series Forecasting through Inverted Framework

Abstract:Currently, iTransformer is one of the most popular and effective models for multivariate time series (MTS) forecasting. Thanks to its inverted framework, iTransformer effectively captures multivariate correlation. However, the inverted framework still has some limitations. It diminishes temporal interdependency information, and introduces noise in cases of nonsignificant variable correlation. To address these limitations, we introduce a novel data augmentation method on inverted framework, called DAIF. Unlike previous data augmentation methods, DAIF stands out as the first real-time augmentation specifically designed for the inverted framework in MTS forecasting. We first define the structure of the inverted sequence-to-sequence framework, then propose two different DAIF strategies, Frequency Filtering and Cross-variation Patching to address the existing challenges of the inverted framework. Experiments across multiple datasets and inverted models have demonstrated the effectiveness of our DAIF.

* The paper is under consideration at Pattern Recognition Letters

Via

Access Paper or Ask Questions

RareAgents: Autonomous Multi-disciplinary Team for Rare Disease Diagnosis and Treatment

Dec 17, 2024

Xuanzhong Chen, Ye Jin, Xiaohao Mao, Lun Wang, Shuyang Zhang, Ting Chen

Figure 1 for RareAgents: Autonomous Multi-disciplinary Team for Rare Disease Diagnosis and Treatment

Figure 2 for RareAgents: Autonomous Multi-disciplinary Team for Rare Disease Diagnosis and Treatment

Figure 3 for RareAgents: Autonomous Multi-disciplinary Team for Rare Disease Diagnosis and Treatment

Figure 4 for RareAgents: Autonomous Multi-disciplinary Team for Rare Disease Diagnosis and Treatment

Abstract:Rare diseases, despite their low individual incidence, collectively impact around 300 million people worldwide due to the huge number of diseases. The complexity of symptoms and the shortage of specialized doctors with relevant experience make diagnosing and treating rare diseases more challenging than common diseases. Recently, agents powered by large language models (LLMs) have demonstrated notable improvements across various domains. In the medical field, some agent methods have outperformed direct prompts in question-answering tasks from medical exams. However, current agent frameworks lack adaptation for real-world clinical scenarios, especially those involving the intricate demands of rare diseases. To address these challenges, we present RareAgents, the first multi-disciplinary team of LLM-based agents tailored to the complex clinical context of rare diseases. RareAgents integrates advanced planning capabilities, memory mechanisms, and medical tools utilization, leveraging Llama-3.1-8B/70B as the base model. Experimental results show that RareAgents surpasses state-of-the-art domain-specific models, GPT-4o, and existing agent frameworks in both differential diagnosis and medication recommendation for rare diseases. Furthermore, we contribute a novel dataset, MIMIC-IV-Ext-Rare, derived from MIMIC-IV, to support further advancements in this field.

Via

Access Paper or Ask Questions

Large-scale cross-modality pretrained model enhances cardiovascular state estimation and cardiomyopathy detection from electrocardiograms: An AI system development and multi-center validation study

Nov 19, 2024

Zhengyao Ding, Yujian Hu, Youyao Xu, Chengchen Zhao, Ziyu Li, Yiheng Mao, Haitao Li, Qian Li, Jing Wang, Yue Chen(+9 more)

Figure 1 for Large-scale cross-modality pretrained model enhances cardiovascular state estimation and cardiomyopathy detection from electrocardiograms: An AI system development and multi-center validation study

Figure 2 for Large-scale cross-modality pretrained model enhances cardiovascular state estimation and cardiomyopathy detection from electrocardiograms: An AI system development and multi-center validation study

Figure 3 for Large-scale cross-modality pretrained model enhances cardiovascular state estimation and cardiomyopathy detection from electrocardiograms: An AI system development and multi-center validation study

Figure 4 for Large-scale cross-modality pretrained model enhances cardiovascular state estimation and cardiomyopathy detection from electrocardiograms: An AI system development and multi-center validation study

Abstract:Cardiovascular diseases (CVDs) present significant challenges for early and accurate diagnosis. While cardiac magnetic resonance imaging (CMR) is the gold standard for assessing cardiac function and diagnosing CVDs, its high cost and technical complexity limit accessibility. In contrast, electrocardiography (ECG) offers promise for large-scale early screening. This study introduces CardiacNets, an innovative model that enhances ECG analysis by leveraging the diagnostic strengths of CMR through cross-modal contrastive learning and generative pretraining. CardiacNets serves two primary functions: (1) it evaluates detailed cardiac function indicators and screens for potential CVDs, including coronary artery disease, cardiomyopathy, pericarditis, heart failure and pulmonary hypertension, using ECG input; and (2) it enhances interpretability by generating high-quality CMR images from ECG data. We train and validate the proposed CardiacNets on two large-scale public datasets (the UK Biobank with 41,519 individuals and the MIMIC-IV-ECG comprising 501,172 samples) as well as three private datasets (FAHZU with 410 individuals, SAHZU with 464 individuals, and QPH with 338 individuals), and the findings demonstrate that CardiacNets consistently outperforms traditional ECG-only models, substantially improving screening accuracy. Furthermore, the generated CMR images provide valuable diagnostic support for physicians of all experience levels. This proof-of-concept study highlights how ECG can facilitate cross-modal insights into cardiac function assessment, paving the way for enhanced CVD screening and diagnosis at a population level.

* 23 pages, 8 figures

Via

Access Paper or Ask Questions

All-in-one Weather-degraded Image Restoration via Adaptive Degradation-aware Self-prompting Model

Nov 12, 2024

Yuanbo Wen, Tao Gao, Ziqi Li, Jing Zhang, Kaihao Zhang, Ting Chen

Figure 1 for All-in-one Weather-degraded Image Restoration via Adaptive Degradation-aware Self-prompting Model

Figure 2 for All-in-one Weather-degraded Image Restoration via Adaptive Degradation-aware Self-prompting Model

Figure 3 for All-in-one Weather-degraded Image Restoration via Adaptive Degradation-aware Self-prompting Model

Figure 4 for All-in-one Weather-degraded Image Restoration via Adaptive Degradation-aware Self-prompting Model

Abstract:Existing approaches for all-in-one weather-degraded image restoration suffer from inefficiencies in leveraging degradation-aware priors, resulting in sub-optimal performance in adapting to different weather conditions. To this end, we develop an adaptive degradation-aware self-prompting model (ADSM) for all-in-one weather-degraded image restoration. Specifically, our model employs the contrastive language-image pre-training model (CLIP) to facilitate the training of our proposed latent prompt generators (LPGs), which represent three types of latent prompts to characterize the degradation type, degradation property and image caption. Moreover, we integrate the acquired degradation-aware prompts into the time embedding of diffusion model to improve degradation perception. Meanwhile, we employ the latent caption prompt to guide the reverse sampling process using the cross-attention mechanism, thereby guiding the accurate image reconstruction. Furthermore, to accelerate the reverse sampling procedure of diffusion model and address the limitations of frequency perception, we introduce a wavelet-oriented noise estimating network (WNE-Net). Extensive experiments conducted on eight publicly available datasets demonstrate the effectiveness of our proposed approach in both task-specific and all-in-one applications.

Via

Access Paper or Ask Questions

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Oct 16, 2024

Lingxiao Luo, Bingda Tang, Xuanzhong Chen, Rong Han, Ting Chen

Figure 1 for VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Figure 2 for VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Figure 3 for VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Figure 4 for VividMed: Vision Language Model with Versatile Visual Grounding for Medicine

Abstract:Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable promise in generating visually grounded responses. However, their application in the medical domain is hindered by unique challenges. For instance, most VLMs rely on a single method of visual grounding, whereas complex medical tasks demand more versatile approaches. Additionally, while most VLMs process only 2D images, a large portion of medical images are 3D. The lack of medical data further compounds these obstacles. To address these challenges, we present VividMed, a vision language model with versatile visual grounding for medicine. Our model supports generating both semantic segmentation masks and instance-level bounding boxes, and accommodates various imaging modalities, including both 2D and 3D data. We design a three-stage training procedure and an automatic data synthesis pipeline based on open datasets and models. Besides visual grounding tasks, VividMed also excels in other common downstream tasks, including Visual Question Answering (VQA) and report generation. Ablation studies empirically show that the integration of visual grounding ability leads to improved performance on these tasks. Our code is publicly available at https://github.com/function2-llx/MMMM.

Via

Access Paper or Ask Questions

Unpaired Photo-realistic Image Deraining with Energy-informed Diffusion Model

Jul 24, 2024

Yuanbo Wen, Tao Gao, Ting Chen

Figure 1 for Unpaired Photo-realistic Image Deraining with Energy-informed Diffusion Model

Figure 2 for Unpaired Photo-realistic Image Deraining with Energy-informed Diffusion Model

Figure 3 for Unpaired Photo-realistic Image Deraining with Energy-informed Diffusion Model

Figure 4 for Unpaired Photo-realistic Image Deraining with Energy-informed Diffusion Model

Abstract:Existing unpaired image deraining approaches face challenges in accurately capture the distinguishing characteristics between the rainy and clean domains, resulting in residual degradation and color distortion within the reconstructed images. To this end, we propose an energy-informed diffusion model for unpaired photo-realistic image deraining (UPID-EDM). Initially, we delve into the intricate visual-language priors embedded within the contrastive language-image pre-training model (CLIP), and demonstrate that the CLIP priors aid in the discrimination of rainy and clean images. Furthermore, we introduce a dual-consistent energy function (DEF) that retains the rain-irrelevant characteristics while eliminating the rain-relevant features. This energy function is trained by the non-corresponding rainy and clean images. In addition, we employ the rain-relevance discarding energy function (RDEF) and the rain-irrelevance preserving energy function (RPEF) to direct the reverse sampling procedure of a pre-trained diffusion model, effectively removing the rain streaks while preserving the image contents. Extensive experiments demonstrate that our energy-informed model surpasses the existing unpaired learning approaches in terms of both supervised and no-reference metrics.

Via

Access Paper or Ask Questions

Layered Diffusion Model for One-Shot High Resolution Text-to-Image Synthesis

Jul 08, 2024

Emaad Khwaja, Abdullah Rashwan, Ting Chen, Oliver Wang, Suraj Kothawade, Yeqing Li

Figure 1 for Layered Diffusion Model for One-Shot High Resolution Text-to-Image Synthesis

Figure 2 for Layered Diffusion Model for One-Shot High Resolution Text-to-Image Synthesis

Figure 3 for Layered Diffusion Model for One-Shot High Resolution Text-to-Image Synthesis

Figure 4 for Layered Diffusion Model for One-Shot High Resolution Text-to-Image Synthesis

Abstract:We present a one-shot text-to-image diffusion model that can generate high-resolution images from natural language descriptions. Our model employs a layered U-Net architecture that simultaneously synthesizes images at multiple resolution scales. We show that this method outperforms the baseline of synthesizing images only at the target resolution, while reducing the computational cost per step. We demonstrate that higher resolution synthesis can be achieved by layering convolutions at additional resolution scales, in contrast to other methods which require additional models for super-resolution synthesis.

Via

Access Paper or Ask Questions