Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xinhan Di

Attentional Triple-Encoder Network in Spatiospectral Domains for Medical Image Segmentation

Mar 20, 2025

Kristin Qi, Xinhan Di

Abstract:Retinal Optical Coherence Tomography (OCT) segmentation is essential for diagnosing pathology. Traditional methods focus on either spatial or spectral domains, overlooking their combined dependencies. We propose a triple-encoder network that integrates CNNs for spatial features, Fast Fourier Convolution (FFC) for spectral features, and attention mechanisms to capture global relationships across both domains. Attention fusion modules integrate convolution and cross-attention to further enhance features. Our method achieves an average Dice score improvement from 0.855 to 0.864, outperforming prior work.

* IEEE Conference on Artificial Intelligence (IEEE CAI)

Via

Access Paper or Ask Questions

Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search

Jan 02, 2025

Shuangtao Li, Shuaihao Dong, Kexin Luan, Xinhan Di, Chaofan Ding

Figure 1 for Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search

Figure 2 for Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search

Figure 3 for Enhancing Reasoning through Process Supervision with Monte Carlo Tree Search

Abstract:Large language models (LLMs) have demonstrated their remarkable capacity across a variety of tasks. However, reasoning remains a challenge for LLMs. To improve LLMs' reasoning ability, process supervision has proven to be better than outcome supervision. In this work, we study using Monte Carlo Tree Search (MCTS) to generate process supervision data with LLMs themselves for training them. We sample reasoning steps with an LLM and assign each step a score that captures its "relative correctness," and the LLM is then trained by minimizing weighted log-likelihood of generating the reasoning steps. This generate-then-train process is repeated iteratively until convergence.Our experimental results demonstrate that the proposed methods considerably improve the performance of LLMs on two mathematical reasoning datasets. Furthermore, models trained on one dataset also exhibit improved performance on the other, showing the transferability of the enhanced reasoning ability.

* 5 pages, 1 figure, 2 tables accepted by aaai 2025 NeurMAD workshop

Via

Access Paper or Ask Questions

Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning

Dec 23, 2024

Huchen Jiang, Yangyang Ma, Chaofan Ding, Kexin Luan, Xinhan Di

Figure 1 for Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning

Figure 2 for Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning

Figure 3 for Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning

Figure 4 for Towards Intrinsic Self-Correction Enhancement in Monte Carlo Tree Search Boosted Reasoning via Iterative Preference Learning

Abstract:With current state-of-the-art approaches aimed at enhancing the reasoning capabilities of Large Language Models(LLMs) through iterative preference learning inspired by AlphaZero, we propose to further enhance the step-wise reasoning capabilities through intrinsic self-correction to some extent. Our work leverages step-wise preference learning to enhance self-verification via reinforcement learning. We initially conduct our work through a two-stage training procedure. At the first stage, the self-correction reasoning ability of an LLM is enhanced through its own predictions, relying entirely on self-generated data within the intrinsic self-correction to some extent. At the second stage, the baseline step-wise preference learning is leveraged via the application of the enhanced self-correct policy achieved at the first stage. In the evaluation of arithmetic reasoning tasks, our approach outperforms OpenMath2-Llama3.1-8B, dart-math-mistral-7b-uniform on MATH with increases in accuracy to 71.34%(+4.18%) and 48.06%(+4.94%) and LLama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.1 on GSM8K with increases in accuracy to 86.76%(+2.00%) and 38.06%(+2.28%).

* 6 Pages,3 figures, accepted by AAAI 2025 Workshop NeurMAD

Via

Access Paper or Ask Questions

Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio

Dec 23, 2024

Gongyu Chen, Haomin Zhang, Chaofan Ding, Zihao Chen, Xinhan Di

Figure 1 for Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio

Figure 2 for Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio

Figure 3 for Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio

Figure 4 for Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio

Abstract:One fascinating aspect of pre-trained Audio-Language Models (ALMs) learning is their impressive zero-shot generalization capability and test-time adaptation (TTA) methods aiming to improve domain performance without annotations. However, previous test time adaptation (TTA) methods for ALMs in zero-shot classification tend to be stuck in incorrect model predictions. In order to further boost the performance, we propose multiple guidance on prompt learning without annotated labels. First, guidance of consistency on both context tokens and domain tokens of ALMs is set. Second, guidance of both consistency across multiple augmented views of each single test sample and contrastive learning across different test samples is set. Third, we propose a corresponding end-end learning framework for the proposed test-time adaptation method without annotated labels. We extensively evaluate our approach on 12 downstream tasks across domains, our proposed adaptation method leads to 4.41% (max 7.50%) average zero-shot performance improvement in comparison with the state-of-the-art models.

* 6 pages, 1 figure, accepted by ICASSP 2025

Via

Access Paper or Ask Questions

Low-Rank Adaptation with Task-Relevant Feature Enhancement for Fine-tuning Language Models

Dec 13, 2024

Changqun Li, Chaofan Ding, Kexin Luan, Xinhan Di

Figure 1 for Low-Rank Adaptation with Task-Relevant Feature Enhancement for Fine-tuning Language Models

Figure 2 for Low-Rank Adaptation with Task-Relevant Feature Enhancement for Fine-tuning Language Models

Figure 3 for Low-Rank Adaptation with Task-Relevant Feature Enhancement for Fine-tuning Language Models

Figure 4 for Low-Rank Adaptation with Task-Relevant Feature Enhancement for Fine-tuning Language Models

Abstract:Fine-tuning pre-trained large language models in a parameter-efficient manner is widely studied for its effectiveness and efficiency. LoRA is one of the most widely used methods, which assumes that the optimization process is essentially low dimensional. Although LoRA has demonstrated commendable performance, there remains a significant performance gap between LoRA and full fine-tuning when learning new tasks. In this work, we propose Low-Rank Adaptation with Task-Relevant Feature Enhancement(LoRATRF) for enhancing task-relevant features from the perspective of editing neural network representations. To prioritize task-relevant features, a task-aware filter that selectively extracts valuable knowledge from hidden representations for the target or current task is designed. As the experiments on a vareity of datasets including NLU, commonsense reasoning and mathematical reasoning tasks demonstrates, our method reduces 33.71% parameters and achieves better performance on a variety of datasets in comparison with SOTA low-rank methods.

* 6 Pages, 3 figures accepted by AAAI 2025 CoLoRAI - Connecting Low-Rank Representations in AI Workshop

Via

Access Paper or Ask Questions

YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

Dec 12, 2024

Zihao Chen, Haomin Zhang, Xinhan Di, Haoyu Wang, Sizhe Shan, Junjie Zheng, Yunming Liang, Yihan Fan, Xinfa Zhu, Wenjie Tian(+3 more)

Figure 1 for YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

Figure 2 for YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

Figure 3 for YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

Figure 4 for YingSound: Video-Guided Sound Effects Generation with Multi-modal Chain-of-Thought Controls

Abstract:Generating sound effects for product-level videos, where only a small amount of labeled data is available for diverse scenes, requires the production of high-quality sounds in few-shot settings. To tackle the challenge of limited labeled data in real-world scenes, we introduce YingSound, a foundation model designed for video-guided sound generation that supports high-quality audio generation in few-shot settings. Specifically, YingSound consists of two major modules. The first module uses a conditional flow matching transformer to achieve effective semantic alignment in sound generation across audio and visual modalities. This module aims to build a learnable audio-visual aggregator (AVA) that integrates high-resolution visual features with corresponding audio features at multiple stages. The second module is developed with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to generate finer sound effects in few-shot settings. Finally, an industry-standard video-to-audio (V2A) dataset that encompasses various real-world scenarios is presented. We show that YingSound effectively generates high-quality synchronized sounds across diverse conditional inputs through automated evaluations and human studies. Project Page: \url{https://giantailab.github.io/yingsound/}

* 16 pages, 4 figures

Via

Access Paper or Ask Questions

Multi-Stage Graph Learning for fMRI Analysis to Diagnose Neuro-Developmental Disorders

Oct 07, 2024

Wenjing Gao, Yuanyuan Yang, Jianrui Wei, Xuntao Yin, Xinhan Di

Figure 1 for Multi-Stage Graph Learning for fMRI Analysis to Diagnose Neuro-Developmental Disorders

Figure 2 for Multi-Stage Graph Learning for fMRI Analysis to Diagnose Neuro-Developmental Disorders

Figure 3 for Multi-Stage Graph Learning for fMRI Analysis to Diagnose Neuro-Developmental Disorders

Figure 4 for Multi-Stage Graph Learning for fMRI Analysis to Diagnose Neuro-Developmental Disorders

Abstract:The insufficient supervision limit the performance of the deep supervised models for brain disease diagnosis. It is important to develop a learning framework that can capture more information in limited data and insufficient supervision. To address these issues at some extend, we propose a multi-stage graph learning framework which incorporates 1) pretrain stage : self-supervised graph learning on insufficient supervision of the fmri data 2) fine-tune stage : supervised graph learning for brain disorder diagnosis. Experiment results on three datasets, Autism Brain Imaging Data Exchange ABIDE I, ABIDE II and ADHD with AAL1,demonstrating the superiority and generalizability of the proposed framework compared to the state of art of models.(ranging from 0.7330 to 0.9321,0.7209 to 0.9021,0.6338 to 0.6699)

* Accepted by CVPR 2024 CV4Science Workshop (8 pages, 4 figures, 2 tables)

Via

Access Paper or Ask Questions

OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects

Oct 02, 2024

Wenmo Qiu, Xinhan Di

Abstract:There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multimodal models fail to provide satisfactory results in describing occluded objects for visual-language multimodal models through universal visual encoders. Another challenge is the limited number of datasets containing image-text pairs with a large number of occluded objects. Therefore, we introduce a novel multimodal model that applies a newly designed visual encoder to understand occluded objects in RGB images. We also introduce a large-scale visual-language pair dataset for training large-scale visual-language multimodal models and understanding occluded objects. We start our experiments comparing with the state-of-the-art models.

* Accepted by CVPR 2024 T4V Workshop (5 pages, 3 figures, 2 tables)

Via

Access Paper or Ask Questions

OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning

Oct 02, 2024

Shuxin Yang, Xinhan Di

Figure 1 for OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning

Figure 2 for OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning

Figure 3 for OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning

Figure 4 for OCC-MLLM-Alpha:Empowering Multi-modal Large Language Model for the Understanding of Occluded Objects with Self-Supervised Test-Time Learning

Abstract:There is a gap in the understanding of occluded objects in existing large-scale visual language multi-modal models. Current state-of-the-art multi-modal models fail to provide satisfactory results in describing occluded objects through universal visual encoders and supervised learning strategies. Therefore, we introduce a multi-modal large language framework and corresponding self-supervised learning strategy with support of 3D generation. We start our experiments comparing with the state-of-the-art models in the evaluation of a large-scale dataset SOMVideo [18]. The initial results demonstrate the improvement of 16.92% in comparison with the state-of-the-art VLM models.

* Accepted by ECCV 2024 Observing and Understanding Hands in Action Workshop (5 pages, 3 figures, 2 tables). arXiv admin note: substantial text overlap with arXiv:2410.01261

Via

Access Paper or Ask Questions

Towards Full-parameter and Parameter-efficient Self-learning For Endoscopic Camera Depth Estimation

Oct 01, 2024

Shuting Zhao, Chenkang Du, Kristin Qi, Xinrong Chen, Xinhan Di

Figure 1 for Towards Full-parameter and Parameter-efficient Self-learning For Endoscopic Camera Depth Estimation

Figure 2 for Towards Full-parameter and Parameter-efficient Self-learning For Endoscopic Camera Depth Estimation

Figure 3 for Towards Full-parameter and Parameter-efficient Self-learning For Endoscopic Camera Depth Estimation

Figure 4 for Towards Full-parameter and Parameter-efficient Self-learning For Endoscopic Camera Depth Estimation

Abstract:Adaptation methods are developed to adapt depth foundation models to endoscopic depth estimation recently. However, such approaches typically under-perform training since they limit the parameter search to a low-rank subspace and alter the training dynamics. Therefore, we propose a full-parameter and parameter-efficient learning framework for endoscopic depth estimation. At the first stage, the subspace of attention, convolution and multi-layer perception are adapted simultaneously within different sub-spaces. At the second stage, a memory-efficient optimization is proposed for subspace composition and the performance is further improved in the united sub-space. Initial experiments on the SCARED dataset demonstrate that results at the first stage improves the performance from 10.2% to 4.1% for Sq Rel, Abs Rel, RMSE and RMSE log in the comparison with the state-of-the-art models.

* WiCV @ ECCV 2024

Via

Access Paper or Ask Questions