Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yan Lu

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

Dec 20, 2024

Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu(+6 more)

Figure 1 for SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

Figure 2 for SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

Figure 3 for SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

Figure 4 for SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

Abstract:Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets.

Via

Access Paper or Ask Questions

MEATRD: Multimodal Anomalous Tissue Region Detection Enhanced with Spatial Transcriptomics

Dec 14, 2024

Kaichen Xu, Qilong Wu, Yan Lu, Yinan Zheng, Wenlin Li, Xingjie Tang, Jun Wang, Xiaobo Sun

Abstract:The detection of anomalous tissue regions (ATRs) within affected tissues is crucial in clinical diagnosis and pathological studies. Conventional automated ATR detection methods, primarily based on histology images alone, falter in cases where ATRs and normal tissues have subtle visual differences. The recent spatial transcriptomics (ST) technology profiles gene expressions across tissue regions, offering a molecular perspective for detecting ATRs. However, there is a dearth of ATR detection methods that effectively harness complementary information from both histology images and ST. To address this gap, we propose MEATRD, a novel ATR detection method that integrates histology image and ST data. MEATRD is trained to reconstruct image patches and gene expression profiles of normal tissue spots (inliers) from their multimodal embeddings, followed by learning a one-class classification AD model based on latent multimodal reconstruction errors. This strategy harmonizes the strengths of reconstruction-based and one-class classification approaches. At the heart of MEATRD is an innovative masked graph dual-attention transformer (MGDAT) network, which not only facilitates cross-modality and cross-node information sharing but also addresses the model over-generalization issue commonly seen in reconstruction-based AD methods. Additionally, we demonstrate that modality-specific, task-relevant information is collated and condensed in multimodal bottleneck encoding generated in MGDAT, marking the first theoretical analysis of the informational properties of multimodal bottleneck encoding. Extensive evaluations across eight real ST datasets reveal MEATRD's superior performance in ATR detection, surpassing various state-of-the-art AD methods. Remarkably, MEATRD also proves adept at discerning ATRs that only show slight visual deviations from normal tissues.

* AAAI 2025. Code: https://github.com/wqlzuel/MEATRD

Via

Access Paper or Ask Questions

UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

Dec 03, 2024

Wenbo Wang, Fangyun Wei, Lei Zhou, Xi Chen, Lin Luo, Xiaohan Yi, Yizhong Zhang, Yaobo Liang, Chang Xu, Yan Lu(+2 more)

Figure 1 for UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

Figure 2 for UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

Figure 3 for UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

Figure 4 for UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

Abstract:We introduce UniGraspTransformer, a universal Transformer-based network for dexterous robotic grasping that simplifies training while enhancing scalability and performance. Unlike prior methods such as UniDexGrasp++, which require complex, multi-step training pipelines, UniGraspTransformer follows a streamlined process: first, dedicated policy networks are trained for individual objects using reinforcement learning to generate successful grasp trajectories; then, these trajectories are distilled into a single, universal network. Our approach enables UniGraspTransformer to scale effectively, incorporating up to 12 self-attention blocks for handling thousands of objects with diverse poses. Additionally, it generalizes well to both idealized and real-world inputs, evaluated in state-based and vision-based settings. Notably, UniGraspTransformer generates a broader range of grasping poses for objects in various shapes and orientations, resulting in more diverse grasp strategies. Experimental results demonstrate significant improvements over state-of-the-art, UniDexGrasp++, across various object categories, achieving success rate gains of 3.5%, 7.7%, and 10.1% on seen objects, unseen objects within seen categories, and completely unseen objects, respectively, in the vision-based setting. Project page: https://dexhand.github.io/UniGraspTransformer.

* Project page: https://dexhand.github.io/UniGraspTransformer

Via

Access Paper or Ask Questions

UniGEM: A Unified Approach to Generation and Property Prediction for Molecules

Oct 14, 2024

Shikun Feng, Yuyan Ni, Yan Lu, Zhi-Ming Ma, Wei-Ying Ma, Yanyan Lan

Figure 1 for UniGEM: A Unified Approach to Generation and Property Prediction for Molecules

Figure 2 for UniGEM: A Unified Approach to Generation and Property Prediction for Molecules

Figure 3 for UniGEM: A Unified Approach to Generation and Property Prediction for Molecules

Figure 4 for UniGEM: A Unified Approach to Generation and Property Prediction for Molecules

Abstract:Molecular generation and molecular property prediction are both crucial for drug discovery, but they are often developed independently. Inspired by recent studies, which demonstrate that diffusion model, a prominent generative approach, can learn meaningful data representations that enhance predictive tasks, we explore the potential for developing a unified generative model in the molecular domain that effectively addresses both molecular generation and property prediction tasks. However, the integration of these tasks is challenging due to inherent inconsistencies, making simple multi-task learning ineffective. To address this, we propose UniGEM, the first unified model to successfully integrate molecular generation and property prediction, delivering superior performance in both tasks. Our key innovation lies in a novel two-phase generative process, where predictive tasks are activated in the later stages, after the molecular scaffold is formed. We further enhance task balance through innovative training strategies. Rigorous theoretical analysis and comprehensive experiments demonstrate our significant improvements in both tasks. The principles behind UniGEM hold promise for broader applications, including natural language processing and computer vision.

* 11 pages, 5 figures

Via

Access Paper or Ask Questions

Adaptive high-precision sound source localization at low frequencies based on convolutional neural network

Sep 30, 2024

Wenbo Ma, Yan Lu, Yijun Liu

Figure 1 for Adaptive high-precision sound source localization at low frequencies based on convolutional neural network

Figure 2 for Adaptive high-precision sound source localization at low frequencies based on convolutional neural network

Figure 3 for Adaptive high-precision sound source localization at low frequencies based on convolutional neural network

Figure 4 for Adaptive high-precision sound source localization at low frequencies based on convolutional neural network

Abstract:Sound source localization (SSL) technology plays a crucial role in various application areas such as fault diagnosis, speech separation, and vibration noise reduction. Although beamforming algorithms are widely used in SSL, their resolution at low frequencies is limited. In recent years, deep learning-based SSL methods have significantly improved their accuracy by employing large microphone arrays and training case specific neural networks, however, this could lead to narrow applicability. To address these issues, this paper proposes a convolutional neural network-based method for high-precision SSL, which is adaptive in the lower frequency range under 1kHz with varying numbers of sound sources and microphone array-to-scanning grid distances. It takes the pressure distribution on a relatively small microphone array as input to the neural network, and employs customized training labels and loss function to train the model. Prediction accuracy, adaptability and robustness of the trained model under certain signal-to-noise ratio (SNR) are evaluated using randomly generated test datasets, and compared with classical beamforming algorithms, CLEAN-SC and DAMAS. Results of both planar and spatial sound source distributions show that the proposed neural network model significantly improves low-frequency localization accuracy, demonstrating its effectiveness and potential in SSL.

Via

Access Paper or Ask Questions

UWF-RI2FA: Generating Multi-frame Ultrawide-field Fluorescein Angiography from Ultrawide-field Retinal Imaging Improves Diabetic Retinopathy Stratification

Aug 27, 2024

Ruoyu Chen, Kezheng Xu, Kangyan Zheng, Weiyi Zhang, Yan Lu, Danli Shi, Mingguang He

Figure 1 for UWF-RI2FA: Generating Multi-frame Ultrawide-field Fluorescein Angiography from Ultrawide-field Retinal Imaging Improves Diabetic Retinopathy Stratification

Figure 2 for UWF-RI2FA: Generating Multi-frame Ultrawide-field Fluorescein Angiography from Ultrawide-field Retinal Imaging Improves Diabetic Retinopathy Stratification

Figure 3 for UWF-RI2FA: Generating Multi-frame Ultrawide-field Fluorescein Angiography from Ultrawide-field Retinal Imaging Improves Diabetic Retinopathy Stratification

Figure 4 for UWF-RI2FA: Generating Multi-frame Ultrawide-field Fluorescein Angiography from Ultrawide-field Retinal Imaging Improves Diabetic Retinopathy Stratification

Abstract:Ultrawide-field fluorescein angiography (UWF-FA) facilitates diabetic retinopathy (DR) detection by providing a clear visualization of peripheral retinal lesions. However, the intravenous dye injection with potential risks hamper its application. We aim to acquire dye-free UWF-FA images from noninvasive UWF retinal imaging (UWF-RI) using generative artificial intelligence (GenAI) and evaluate its effectiveness in DR screening. A total of 18,321 UWF-FA images of different phases were registered with corresponding UWF-RI images and fed into a generative adversarial networks (GAN)-based model for training. The quality of generated UWF-FA images was evaluated through quantitative metrics and human evaluation. The DeepDRiD dataset was used to externally assess the contribution of generated UWF-FA images to DR classification, using area under the receiver operating characteristic curve (AUROC) as outcome metrics. The generated early, mid, and late phase UWF-FA images achieved high authenticity, with multi-scale similarity scores ranging from 0.70 to 0.91 and qualitative visual scores ranging from 1.64 to 1.98 (1=real UWF-FA quality). In fifty randomly selected images, 56% to 76% of the generated images were difficult to distinguish from real images in the Turing test. Moreover, adding these generated UWF-FA images for DR classification significantly increased the AUROC from 0.869 to 0.904 compared to the baseline model using UWF-RI images (P < .001). The model successfully generates realistic multi-frame UWF-FA images for enhancing DR stratification without intravenous dye injection.

* 22 pages, 2 figures

Via

Access Paper or Ask Questions

Convert and Speak: Zero-shot Accent Conversion with Minimum Supervision

Aug 22, 2024

Zhijun Jia, Huaying Xue, Xiulian Peng, Yan Lu

Abstract:Low resource of parallel data is the key challenge of accent conversion(AC) problem in which both the pronunciation units and prosody pattern need to be converted. We propose a two-stage generative framework "convert-and-speak" in which the conversion is only operated on the semantic token level and the speech is synthesized conditioned on the converted semantic token with a speech generative model in target accent domain. The decoupling design enables the "speaking" module to use massive amount of target accent speech and relieves the parallel data required for the "conversion" module. Conversion with the bridge of semantic token also relieves the requirement for the data with text transcriptions and unlocks the usage of language pre-training technology to further efficiently reduce the need of parallel accent speech data. To reduce the complexity and latency of "speaking", a single-stage AR generative model is designed to achieve good quality as well as lower computation cost. Experiments on Indian-English to general American-English conversion show that the proposed framework achieves state-of-the-art performance in accent similarity, speech quality, and speaker maintenance with only 15 minutes of weakly parallel data which is not constrained to the same speaker. Extensive experimentation with diverse accent types suggests that this framework possesses a high degree of adaptability, making it readily scalable to accommodate other accents with low-resource data. Audio samples are available at https://www.microsoft.com/en-us/research/project/convert-and-speak-zero-shot-accent-conversion-with-minimumsupervision/.

* 9 pages, ACM MM2024(accepted)

Via

Access Paper or Ask Questions

Generating Multi-frame Ultrawide-field Fluorescein Angiography from Ultrawide-field Color Imaging Improves Diabetic Retinopathy Stratification

Aug 20, 2024

Ruoyu Chen, Kezheng Xu, Kangyan Zheng, Weiyi Zhang, Yan Lu, Danli Shi, Mingguang He

Figure 1 for Generating Multi-frame Ultrawide-field Fluorescein Angiography from Ultrawide-field Color Imaging Improves Diabetic Retinopathy Stratification

Figure 2 for Generating Multi-frame Ultrawide-field Fluorescein Angiography from Ultrawide-field Color Imaging Improves Diabetic Retinopathy Stratification

Figure 3 for Generating Multi-frame Ultrawide-field Fluorescein Angiography from Ultrawide-field Color Imaging Improves Diabetic Retinopathy Stratification

Figure 4 for Generating Multi-frame Ultrawide-field Fluorescein Angiography from Ultrawide-field Color Imaging Improves Diabetic Retinopathy Stratification

Abstract:Ultrawide-field fluorescein angiography (UWF-FA) facilitates diabetic retinopathy (DR) detection by providing a clear visualization of peripheral retinal lesions. However, the intravenous dye injection with potential risks hamper its application. We aim to acquire dye-free UWF-FA images from noninvasive UWF color fundus (UWF-CF) images using generative artificial intelligence (GenAI) and evaluate its effectiveness in DR screening. A total of 18,321 UWF-FA images of different phases were registered with corresponding UWF-CF images and fed into a generative adversarial networks (GAN)-based model for training. The quality of generated UWF-FA images was evaluated through quantitative metrics and human evaluation. The DeepDRiD dataset was used to externally assess the contribution of generated UWF-FA images to DR classification, using area under the receiver operating characteristic curve (AUROC) as outcome metrics. The generated early, mid, and late phase UWF-FA images achieved high authenticity, with multi-scale similarity scores ranging from 0.70 to 0.91 and qualitative visual scores ranging from 1.64 to 1.98 (1=real UWF-FA quality). In fifty randomly selected images, 56% to 76% of the generated images were difficult to distinguish from real images in the Turing test. Moreover, adding these generated UWF-FA images for DR classification significantly increased the AUROC from 0.869 to 0.904 compared to the baseline model using UWF-CF images (P < .001). The model successfully generates realistic multi-frame UWF-FA images without intravenous dye injection. The generated UWF-FA enhanced DR stratification.

* 27 pages, 2 figures

Via

Access Paper or Ask Questions

A General Theory for Compositional Generalization

May 20, 2024

Jingwen Fu, Zhizheng Zhang, Yan Lu, Nanning Zheng

Abstract:Compositional Generalization (CG) embodies the ability to comprehend novel combinations of familiar concepts, representing a significant cognitive leap in human intellectual advancement. Despite its critical importance, the deep neural network (DNN) faces challenges in addressing the compositional generalization problem, prompting considerable research interest. However, existing theories often rely on task-specific assumptions, constraining the comprehensive understanding of CG. This study aims to explore compositional generalization from a task-agnostic perspective, offering a complementary viewpoint to task-specific analyses. The primary challenge is to define CG without overly restricting its scope, a feat achieved by identifying its fundamental characteristics and basing the definition on them. Using this definition, we seek to answer the question "what does the ultimate solution to CG look like?" through the following theoretical findings: 1) the first No Free Lunch theorem in CG, indicating the absence of general solutions; 2) a novel generalization bound applicable to any CG problem, specifying the conditions for an effective CG solution; and 3) the introduction of the generative effect to enhance understanding of CG problems and their solutions. This paper's significance lies in providing a general theory for CG problems, which, when combined with prior theorems under task-specific scenarios, can lead to a comprehensive understanding of CG.

Via

Access Paper or Ask Questions

Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

May 13, 2024

Tianci Bi, Xiaoyi Zhang, Zhizheng Zhang, Wenxuan Xie, Cuiling Lan, Yan Lu, Nanning Zheng

Figure 1 for Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

Figure 2 for Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

Figure 3 for Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

Figure 4 for Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

Abstract:Significant progress has been made in scene text detection models since the rise of deep learning, but scene text layout analysis, which aims to group detected text instances as paragraphs, has not kept pace. Previous works either treated text detection and grouping using separate models, or train a model from scratch while using a unified one. All of them have not yet made full use of the already well-trained text detectors and easily obtainable detection datasets. In this paper, we present Text Grouping Adapter (TGA), a module that can enable the utilization of various pre-trained text detectors to learn layout analysis, allowing us to adopt a well-trained text detector right off the shelf or just fine-tune it efficiently. Designed to be compatible with various text detector architectures, TGA takes detected text regions and image features as universal inputs to assemble text instance features. To capture broader contextual information for layout analysis, we propose to predict text group masks from text instance features by one-to-many assignment. Our comprehensive experiments demonstrate that, even with frozen pre-trained models, incorporating our TGA into various pre-trained text detectors and text spotters can achieve superior layout analysis performance, simultaneously inheriting generalized text detection ability from pre-training. In the case of full parameter fine-tuning, we can further improve layout analysis performance.

* Accepted to CVPR 2024

Via

Access Paper or Ask Questions