Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shan Yang

Contrast-Free Myocardial Scar Segmentation in Cine MRI using Motion and Texture Fusion

Jan 09, 2025

Guang Yang, Jingkun Chen, Xicheng Sheng, Shan Yang, Xiahai Zhuang, Betty Raman, Lei Li, Vicente Grau

Abstract:Late gadolinium enhancement MRI (LGE MRI) is the gold standard for the detection of myocardial scars for post myocardial infarction (MI). LGE MRI requires the injection of a contrast agent, which carries potential side effects and increases scanning time and patient discomfort. To address these issues, we propose a novel framework that combines cardiac motion observed in cine MRI with image texture information to segment the myocardium and scar tissue in the left ventricle. Cardiac motion tracking can be formulated as a full cardiac image cycle registration problem, which can be solved via deep neural networks. Experimental results prove that the proposed method can achieve scar segmentation based on non-contrasted cine images with comparable accuracy to LGE MRI. This demonstrates its potential as an alternative to contrast-enhanced techniques for scar detection.

* 5 pages, 2figs, 2tables

Via

Access Paper or Ask Questions

DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions

Jan 08, 2025

Weidong Chen, Shan Yang, Guangzhi Li, Xixin Wu

Figure 1 for DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions

Figure 2 for DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions

Figure 3 for DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions

Figure 4 for DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions

Abstract:Controlling text-to-speech (TTS) systems to synthesize speech with the prosodic characteristics expected by users has attracted much attention. To achieve controllability, current studies focus on two main directions: (1) using reference speech as prosody prompt to guide speech synthesis, and (2) using natural language descriptions to control the generation process. However, finding reference speech that exactly contains the prosody that users want to synthesize takes a lot of effort. Description-based guidance in TTS systems can only determine the overall prosody, which has difficulty in achieving fine-grained prosody control over the synthesized speech. In this paper, we propose DrawSpeech, a sketch-conditioned diffusion model capable of generating speech based on any prosody sketches drawn by users. Specifically, the prosody sketches are fed to DrawSpeech to provide a rough indication of the expected prosody trends. DrawSpeech then recovers the detailed pitch and energy contours based on the coarse sketches and synthesizes the desired speech. Experimental results show that DrawSpeech can generate speech with a wide variety of prosody and can precisely control the fine-grained prosody in a user-friendly manner. Our implementation and audio samples are publicly available.

* Accepted by ICASSP 2025

Via

Access Paper or Ask Questions

FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

Jan 08, 2025

Hanzhao Li, Yuke Li, Xinsheng Wang, Jingbin Hu, Qicong Xie, Shan Yang, Lei Xie

Figure 1 for FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

Figure 2 for FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

Figure 3 for FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

Figure 4 for FleSpeech: Flexibly Controllable Speech Generation with Various Prompts

Abstract:Controllable speech generation methods typically rely on single or fixed prompts, hindering creativity and flexibility. These limitations make it difficult to meet specific user needs in certain scenarios, such as adjusting the style while preserving a selected speaker's timbre, or choosing a style and generating a voice that matches a character's visual appearance. To overcome these challenges, we propose \textit{FleSpeech}, a novel multi-stage speech generation framework that allows for more flexible manipulation of speech attributes by integrating various forms of control. FleSpeech employs a multimodal prompt encoder that processes and unifies different text, audio, and visual prompts into a cohesive representation. This approach enhances the adaptability of speech synthesis and supports creative and precise control over the generated speech. Additionally, we develop a data collection pipeline for multimodal datasets to facilitate further research and applications in this field. Comprehensive subjective and objective experiments demonstrate the effectiveness of FleSpeech. Audio samples are available at https://kkksuper.github.io/FleSpeech/

* 14 pages, 3 figures

Via

Access Paper or Ask Questions

MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

Dec 02, 2024

Shan Yang

Figure 1 for MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

Figure 2 for MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

Figure 3 for MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

Figure 4 for MFTF: Mask-free Training-free Object Level Layout Control Diffusion Model

Abstract:Text-to-image generation models have become transformative tools. However, diffusion-based vision language models still lack the ability to precisely control the shape, appearance, and positional placement of objects in generated images using text guidance alone. Global image editing models typically achieve global layout control by relying on additional masks or images as guidance, which often require model training. Although local object-editing models enable modification of object shapes, they do not provide control over the positional placement of these objects. To address these limitations, we propose the MFTF model, which enables precise control over object positioning without requiring additional masks or images. The MFTF model supports both single-object and multi-object positional control (such as translation, rotation, etc.) and allows for concurrent layout control and object semantic editing. This is achieved by controlling the denoising process of the diffusion model through parallel denoising. Attention masks are dynamically generated from the cross-attention layers of the source diffusion model and applied to queries from the self-attention layers to isolate objects. These queries are then modified according to layout control parameters and injected back into the self-attention layers of the target diffusion model to enable precise positional control.

* 9 pages, 12 figures

Via

Access Paper or Ask Questions

Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting

Oct 02, 2024

Siyi Liu, Yang Li, Jiang Li, Shan Yang, Yunshi Lan

Figure 1 for Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting

Figure 2 for Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting

Figure 3 for Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting

Figure 4 for Unleashing the Power of Large Language Models in Zero-shot Relation Extraction via Self-Prompting

Abstract:Recent research in zero-shot Relation Extraction (RE) has focused on using Large Language Models (LLMs) due to their impressive zero-shot capabilities. However, current methods often perform suboptimally, mainly due to a lack of detailed, context-specific prompts needed for understanding various sentences and relations. To address this, we introduce the Self-Prompting framework, a novel method designed to fully harness the embedded RE knowledge within LLMs. Specifically, our framework employs a three-stage diversity approach to prompt LLMs, generating multiple synthetic samples that encapsulate specific relations from scratch. These generated samples act as in-context learning samples, offering explicit and context-specific guidance to efficiently prompt LLMs for RE. Experimental evaluations on benchmark datasets show our approach outperforms existing LLM-based zero-shot RE methods. Additionally, our experiments confirm the effectiveness of our generation pipeline in producing high-quality synthetic data that enhances performance.

* EMNLP 2024 Short

Via

Access Paper or Ask Questions

Multi-centric AI Model for Unruptured Intracranial Aneurysm Detection and Volumetric Segmentation in 3D TOF-MRI

Aug 30, 2024

Ashraya K. Indrakanti, Jakob Wasserthal, Martin Segeroth, Shan Yang, Victor Schulze-Zachau, Joshy Cyriac, Michael Bach, Marios Psychogios, Matthias A. Mutke

Figure 1 for Multi-centric AI Model for Unruptured Intracranial Aneurysm Detection and Volumetric Segmentation in 3D TOF-MRI

Figure 2 for Multi-centric AI Model for Unruptured Intracranial Aneurysm Detection and Volumetric Segmentation in 3D TOF-MRI

Figure 3 for Multi-centric AI Model for Unruptured Intracranial Aneurysm Detection and Volumetric Segmentation in 3D TOF-MRI

Figure 4 for Multi-centric AI Model for Unruptured Intracranial Aneurysm Detection and Volumetric Segmentation in 3D TOF-MRI

Abstract:Purpose: To develop an open-source nnU-Net-based AI model for combined detection and segmentation of unruptured intracranial aneurysms (UICA) in 3D TOF-MRI, and compare models trained on datasets with aneurysm-like differential diagnoses. Methods: This retrospective study (2020-2023) included 385 anonymized 3D TOF-MRI images from 364 patients (mean age 59 years, 60% female) at multiple centers plus 113 subjects from the ADAM challenge. Images featured untreated or possible UICAs and differential diagnoses. Four distinct training datasets were created, and the nnU-Net framework was used for model development. Performance was assessed on a separate test set using sensitivity and False Positive (FP)/case rate for detection, and DICE score and NSD (Normalized Surface Distance) with a 0.5mm threshold for segmentation. Statistical analysis included chi-square, Mann-Whitney-U, and Kruskal-Wallis tests, with significance set at p < 0.05. Results: Models achieved overall sensitivity between 82% and 85% and a FP/case rate of 0.20 to 0.31, with no significant differences (p = 0.90 and p = 0.16). The primary model showed 85% sensitivity and 0.23 FP/case rate, outperforming the ADAM-challenge winner (61%) and a nnU-Net trained on ADAM data (51%) in sensitivity (p < 0.05). It achieved a mean DICE score of 0.73 and an NSD of 0.84 for correctly detected UICA. Conclusions: Our open-source, nnU-Net-based AI model (available at 10.5281/zenodo.13386859) demonstrates high sensitivity, low false positive rates, and consistent segmentation accuracy for UICA detection and segmentation in 3D TOF-MRI, suggesting its potential to improve clinical diagnosis and for monitoring of UICA.

* 14 pages, 5 figures, 3 tables, 2 supplementary tables

Via

Access Paper or Ask Questions

TokSing: Singing Voice Synthesis based on Discrete Tokens

Jun 12, 2024

Yuning Wu, Chunlei zhang, Jiatong Shi, Yuxun Tang, Shan Yang, Qin Jin

Figure 1 for TokSing: Singing Voice Synthesis based on Discrete Tokens

Figure 2 for TokSing: Singing Voice Synthesis based on Discrete Tokens

Figure 3 for TokSing: Singing Voice Synthesis based on Discrete Tokens

Figure 4 for TokSing: Singing Voice Synthesis based on Discrete Tokens

Abstract:Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate representations compared to traditional continuous Mel spectrograms. However, when it comes to singing voice synthesis(SVS), achieving higher levels of melody expression poses a great challenge for utilizing discrete tokens. In this paper, we introduce TokSing, a discrete-based SVS system equipped with a token formulator that offers flexible token blendings. We observe a melody degradation during discretization, prompting us to integrate a melody signal with the discrete token and incorporate a specially-designed melody enhancement strategy in the musical encoder. Extensive experiments demonstrate that our TokSing achieves better performance against the Mel spectrogram baselines while offering advantages in intermediate representation space cost and convergence speed.

* Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

TotalSegmentator MRI: Sequence-Independent Segmentation of 59 Anatomical Structures in MR images

May 29, 2024

Tugba Akinci D'Antonoli, Lucas K. Berger, Ashraya K. Indrakanti, Nathan Vishwanathan, Jakob Weiß, Matthias Jung, Zeynep Berkarda, Alexander Rau, Marco Reisert, Thomas Küstner(+6 more)

Figure 1 for TotalSegmentator MRI: Sequence-Independent Segmentation of 59 Anatomical Structures in MR images

Figure 2 for TotalSegmentator MRI: Sequence-Independent Segmentation of 59 Anatomical Structures in MR images

Figure 3 for TotalSegmentator MRI: Sequence-Independent Segmentation of 59 Anatomical Structures in MR images

Figure 4 for TotalSegmentator MRI: Sequence-Independent Segmentation of 59 Anatomical Structures in MR images

Abstract:Purpose: To develop an open-source and easy-to-use segmentation model that can automatically and robustly segment most major anatomical structures in MR images independently of the MR sequence. Materials and Methods: In this study we extended the capabilities of TotalSegmentator to MR images. 298 MR scans and 227 CT scans were used to segment 59 anatomical structures (20 organs, 18 bones, 11 muscles, 7 vessels, 3 tissue types) relevant for use cases such as organ volumetry, disease characterization, and surgical planning. The MR and CT images were randomly sampled from routine clinical studies and thus represent a real-world dataset (different ages, pathologies, scanners, body parts, sequences, contrasts, echo times, repetition times, field strengths, slice thicknesses and sites). We trained an nnU-Net segmentation algorithm on this dataset and calculated Dice similarity coefficients (Dice) to evaluate the model's performance. Results: The model showed a Dice score of 0.824 (CI: 0.801, 0.842) on the test set, which included a wide range of clinical data with major pathologies. The model significantly outperformed two other publicly available segmentation models (Dice score, 0.824 versus 0.762; p<0.001 and 0.762 versus 0.542; p<0.001). On the CT image test set of the original TotalSegmentator paper it almost matches the performance of the original TotalSegmentator (Dice score, 0.960 versus 0.970; p<0.001). Conclusion: Our proposed model extends the capabilities of TotalSegmentator to MR images. The annotated dataset (https://zenodo.org/doi/10.5281/zenodo.11367004) and open-source toolkit (https://www.github.com/wasserth/TotalSegmentator) are publicly available.

Via

Access Paper or Ask Questions

EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

Feb 21, 2024

Zhendong Xiao, Changhao Chen, Shan Yang, Wu Wei

Figure 1 for EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

Figure 2 for EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

Figure 3 for EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

Figure 4 for EffLoc: Lightweight Vision Transformer for Efficient 6-DOF Camera Relocalization

Abstract:Camera relocalization is pivotal in computer vision, with applications in AR, drones, robotics, and autonomous driving. It estimates 3D camera position and orientation (6-DoF) from images. Unlike traditional methods like SLAM, recent strides use deep learning for direct end-to-end pose estimation. We propose EffLoc, a novel efficient Vision Transformer for single-image camera relocalization. EffLoc's hierarchical layout, memory-bound self-attention, and feed-forward layers boost memory efficiency and inter-channel communication. Our introduced sequential group attention (SGA) module enhances computational efficiency by diversifying input features, reducing redundancy, and expanding model capacity. EffLoc excels in efficiency and accuracy, outperforming prior methods, such as AtLoc and MapNet. It thrives on large-scale outdoor car-driving scenario, ensuring simplicity, end-to-end trainability, and eliminating handcrafted loss functions.

* 8 pages, 6 figures, ICRA 2024 accepted

Via

Access Paper or Ask Questions

MLLMReID: Multimodal Large Language Model-based Person Re-identification

Jan 24, 2024

Shan Yang, Yongfei Zhang

Figure 1 for MLLMReID: Multimodal Large Language Model-based Person Re-identification

Figure 2 for MLLMReID: Multimodal Large Language Model-based Person Re-identification

Figure 3 for MLLMReID: Multimodal Large Language Model-based Person Re-identification

Figure 4 for MLLMReID: Multimodal Large Language Model-based Person Re-identification

Abstract:Multimodal large language models (MLLM) have achieved satisfactory results in many tasks. However, their performance in the task of person re-identification (ReID) has not been explored to date. This paper will investigate how to adapt them for the task of ReID. An intuitive idea is to fine-tune MLLM with ReID image-text datasets, and then use their visual encoder as a backbone for ReID. However, there still exist two apparent issues: (1) Designing instructions for ReID, MLLMs may overfit specific instructions, and designing a variety of instructions will lead to higher costs. (2) Latent image feature vectors from LLMs are not involved in loss computation. Instructional learning, aligning image-text features, results in indirect optimization and a learning objective that inadequately utilizes features, limiting effectiveness in person feature learning. To address these problems, this paper proposes MLLMReID: Multimodal Large Language Model-based ReID. Firstly, we proposed Common Instruction, a simple approach that leverages the essence ability of LLMs to continue writing, avoiding complex and diverse instruction design. Secondly, we proposed DirectReID, which effectively employs the latent image feature vectors of images outputted by LLMs in ReID tasks. The experimental results demonstrate the superiority of our method. We will open-source the code on GitHub.

Via

Access Paper or Ask Questions