Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Haizhou Li

SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model

Nov 12, 2024

Xinyuan Qian, Jiaran Gao, Yaodan Zhang, Qiquan Zhang, Hexin Liu, Leibny Paola Garcia, Haizhou Li

Figure 1 for SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model

Figure 2 for SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model

Figure 3 for SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model

Figure 4 for SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model

Abstract:Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas contextual visual cues from the surrounding environment have been overlooked: for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking noise. To this end, in this paper, we introduce a novel task, i.e. SAV-SE. To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance. Specifically, we propose the VC-S$^2$E method, which incorporates the Conformer and Mamba modules for their complementary strengths. Extensive experiments are conducted on public MUSIC, AVSpeech and AudioSet datasets, where the results demonstrate the superiority of VC-S$^2$E over other competitive methods. We will make the source code publicly available. Project demo page: https://AVSEPage.github.io/

Via

Access Paper or Ask Questions

Speech Separation with Pretrained Frontend to Minimize Domain Mismatch

Nov 05, 2024

Wupeng Wang, Zexu Pan, Xinke Li, Shuai Wang, Haizhou Li

Figure 1 for Speech Separation with Pretrained Frontend to Minimize Domain Mismatch

Figure 2 for Speech Separation with Pretrained Frontend to Minimize Domain Mismatch

Figure 3 for Speech Separation with Pretrained Frontend to Minimize Domain Mismatch

Figure 4 for Speech Separation with Pretrained Frontend to Minimize Domain Mismatch

Abstract:Speech separation seeks to separate individual speech signals from a speech mixture. Typically, most separation models are trained on synthetic data due to the unavailability of target reference in real-world cocktail party scenarios. As a result, there exists a domain gap between real and synthetic data when deploying speech separation models in real-world applications. In this paper, we propose a self-supervised domain-invariant pretrained (DIP) frontend that is exposed to mixture data without the need for target reference speech. The DIP frontend utilizes a Siamese network with two innovative pretext tasks, mixture predictive coding (MPC) and mixture invariant coding (MIC), to capture shared contextual cues between real and synthetic unlabeled mixtures. Subsequently, we freeze the DIP frontend as a feature extractor when training the downstream speech separation models on synthetic data. By pretraining the DIP frontend with the contextual cues, we expect that the speech separation skills learned from synthetic data can be effectively transferred to real data. To benefit from the DIP frontend, we introduce a novel separation pipeline to align the feature resolution of the separation models. We evaluate the speech separation quality on standard benchmarks and real-world datasets. The results confirm the superiority of our DIP frontend over existing speech separation models. This study underscores the potential of large-scale pretraining to enhance the quality and intelligibility of speech separation in real-world applications.

* IEEE/ACM Transactions on Audio, Speech, and Language Processing.32(2024)4184-4198
* IEEE/ACM Transactions on Audio, Speech, and Language Processing

Via

Access Paper or Ask Questions

VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions

Oct 29, 2024

Guanyan Chen, Meiling Wang, Te Cui, Yao Mu, Haoyang Lu, Tianxing Zhou, Zicai Peng, Mengxiao Hu, Haizhou Li, Yuan Li(+2 more)

Figure 1 for VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions

Figure 2 for VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions

Figure 3 for VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions

Figure 4 for VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions

Abstract:Visual imitation learning (VIL) provides an efficient and intuitive strategy for robotic systems to acquire novel skills. Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable performance in vision and language reasoning capabilities for VIL tasks. Despite the progress, current VIL methods naively employ VLMs to learn high-level plans from human videos, relying on pre-defined motion primitives for executing physical interactions, which remains a major bottleneck. In this work, we present VLMimic, a novel paradigm that harnesses VLMs to directly learn even fine-grained action levels, only given a limited number of human videos. Specifically, VLMimic first grounds object-centric movements from human videos, and learns skills using hierarchical constraint representations, facilitating the derivation of skills with fine-grained action levels from limited human videos. These skills are refined and updated through an iterative comparison strategy, enabling efficient adaptation to unseen environments. Our extensive experiments exhibit that our VLMimic, using only 5 human videos, yields significant improvements of over 27% and 21% in RLBench and real-world manipulation tasks, and surpasses baselines by over 37% in long-horizon tasks.

Via

Access Paper or Ask Questions

VoiceBench: Benchmarking LLM-Based Voice Assistants

Oct 22, 2024

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, Haizhou Li

Figure 1 for VoiceBench: Benchmarking LLM-Based Voice Assistants

Figure 2 for VoiceBench: Benchmarking LLM-Based Voice Assistants

Figure 3 for VoiceBench: Benchmarking LLM-Based Voice Assistants

Figure 4 for VoiceBench: Benchmarking LLM-Based Voice Assistants

Abstract:Building on the success of large language models (LLMs), recent advancements such as GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering a significantly improved user experience compared to traditional text-based interactions. However, the absence of benchmarks designed to evaluate these speech interaction capabilities has hindered progress of LLM-based voice assistants development. Current evaluations focus primarily on automatic speech recognition (ASR) or general knowledge evaluation with clean speeches, neglecting the more intricate, real-world scenarios that involve diverse speaker characteristics, environmental and content factors. To address this, we introduce VoiceBench, the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants. VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.

* Work in progress. Data is available at https://github.com/MatthewCYM/VoiceBench

Via

Access Paper or Ask Questions

Multi-Level Speaker Representation for Target Speaker Extraction

Oct 21, 2024

Ke Zhang, Junjie Li, Shuai Wang, Yangjie Wei, Yi Wang, Yannan Wang, Haizhou Li

Abstract:Target speaker extraction (TSE) relies on a reference cue of the target to extract the target speech from a speech mixture. While a speaker embedding is commonly used as the reference cue, such embedding pre-trained with a large number of speakers may suffer from confusion of speaker identity. In this work, we propose a multi-level speaker representation approach, from raw features to neural embeddings, to serve as the speaker reference cue. We generate a spectral-level representation from the enrollment magnitude spectrogram as a raw, low-level feature, which significantly improves the model's generalization capability. Additionally, we propose a contextual embedding feature based on cross-attention mechanisms that integrate frame-level embeddings from a pre-trained speaker encoder. By incorporating speaker features across multiple levels, we significantly enhance the performance of the TSE model. Our approach achieves a 2.74 dB improvement and a 4.94% increase in extraction accuracy on Libri2mix test set over the baseline.

* 5 pages. Submitted to ICASSP 2025. Implementation will be released at https://github.com/wenet-e2e/wesep

Via

Access Paper or Ask Questions

Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech

Oct 18, 2024

Shuwei He, Rui Liu, Haizhou Li

Figure 1 for Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech

Figure 2 for Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech

Figure 3 for Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech

Abstract:Visual Text-to-Speech (VTTS) aims to take the spatial environmental image as the prompt to synthesize the reverberation speech for the spoken content. Previous research focused on the RGB modality for global environmental modeling, overlooking the potential of multi-source spatial knowledge like depth, speaker position, and environmental semantics. To address the issues, we propose a novel multi-source spatial knowledge understanding scheme for immersive VTTS, termed MS$^2$KU-VTTS. Specifically, we first prioritize RGB image as the dominant source and consider depth image, speaker position knowledge from object detection, and semantic captions from image understanding LLM as supplementary sources. Afterwards, we propose a serial interaction mechanism to deeply engage with both dominant and supplementary sources. The resulting multi-source knowledge is dynamically integrated based on their contributions.This enriched interaction and integration of multi-source spatial knowledge guides the speech generation model, enhancing the immersive spatial speech experience.Experimental results demonstrate that the MS$^2$KU-VTTS surpasses existing baselines in generating immersive speech. Demos and code are available at: https://github.com/MS2KU-VTTS/MS2KU-VTTS.

* 5 pages, 1 figure

Via

Access Paper or Ask Questions

Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement

Oct 18, 2024

Zihao Cheng, Li Zhou, Feng Jiang, Benyou Wang, Haizhou Li

Abstract:The rapid development of large language models (LLMs), like ChatGPT, has resulted in the widespread presence of LLM-generated content on social media platforms, raising concerns about misinformation, data biases, and privacy violations, which can undermine trust in online discourse. While detecting LLM-generated content is crucial for mitigating these risks, current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-AI collaboration. To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content. This approach introduces two novel tasks: LLM Role Recognition (LLM-RR), a multi-class classification task that identifies specific roles of LLM in content generation, and LLM Influence Measurement (LLM-IM), a regression task that quantifies the extent of LLM involvement in content creation. To support these tasks, we propose LLMDetect, a benchmark designed to evaluate detectors' performance on these new tasks. LLMDetect includes the Hybrid News Detection Corpus (HNDC) for training detectors, as well as DetectEval, a comprehensive evaluation suite that considers five distinct cross-context variations and multi-intensity variations within the same LLM role. This allows for a thorough assessment of detectors' generalization and robustness across diverse contexts. Our empirical validation of 10 baseline detection methods demonstrates that fine-tuned PLM-based models consistently outperform others on both tasks, while advanced LLMs face challenges in accurately detecting their own generated content. Our experimental results and analysis offer insights for developing more effective detection models for LLM-generated content. This research enhances the understanding of LLM-generated content and establishes a foundation for more nuanced detection methodologies.

* Social Media, Large Language Models, LLM-generated Text Detection, AI-assisted News Detection

Via

Access Paper or Ask Questions

Roadmap towards Superhuman Speech Understanding using Large Language Models

Oct 17, 2024

Fan Bu, Yuhao Zhang, Xidong Wang, Benyou Wang, Qun Liu, Haizhou Li

Figure 1 for Roadmap towards Superhuman Speech Understanding using Large Language Models

Figure 2 for Roadmap towards Superhuman Speech Understanding using Large Language Models

Figure 3 for Roadmap towards Superhuman Speech Understanding using Large Language Models

Figure 4 for Roadmap towards Superhuman Speech Understanding using Large Language Models

Abstract:The success of large language models (LLMs) has prompted efforts to integrate speech and audio data, aiming to create general foundation models capable of processing both textual and non-textual inputs. Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs, which preserves non-semantic information and world knowledge for deeper speech understanding. To guide the development of speech LLMs, we propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models capable of integrating non-semantic information with abstract acoustic knowledge for complex tasks. Moreover, we design a benchmark, SAGI Bechmark, that standardizes critical aspects across various tasks in these five levels, uncovering challenges in using abstract acoustic knowledge and completeness of capability. Our findings reveal gaps in handling paralinguistic cues and abstract acoustic knowledge, and we offer future directions. This paper outlines a roadmap for advancing speech LLMs, introduces a benchmark for evaluation, and provides key insights into their current limitations and potential.

Via

Access Paper or Ask Questions

Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

Oct 12, 2024

Rui Liu, Zhenqi Jia, Jie Yang, Yifan Hu, Haizhou Li

Figure 1 for Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

Figure 2 for Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

Figure 3 for Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

Figure 4 for Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

Abstract:Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting, which attracts more attention nowadays. While recognizing the significance of the CTTS task, prior studies have not thoroughly investigated speech emphasis expression, which is essential for conveying the underlying intention and attitude in human-machine interaction scenarios, due to the scarcity of conversational emphasis datasets and the difficulty in context understanding. In this paper, we propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS, that includes two main components: 1) we simultaneously take into account textual and acoustic contexts, with both global and local semantic modeling to understand the conversation context comprehensively; 2) we deeply integrate multi-modal and multi-scale context to learn the influence of context on the emphasis expression of the current utterance. Finally, the inferred emphasis feature is fed into the neural speech synthesizer to generate conversational speech. To address data scarcity, we create emphasis intensity annotations on the existing conversational dataset (DailyTalk). Both objective and subjective evaluations suggest that our model outperforms the baseline models in emphasis rendering within a conversational setting. The code and audio samples are available at https://github.com/CodeStoreTTS/ER-CTTS.

* submitted to IEEE Transaction

Via

Access Paper or Ask Questions

Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

Sep 27, 2024

Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, Luis Fernando D'Haro, Robby T. Tan, Haizhou Li

Figure 1 for Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

Figure 2 for Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

Figure 3 for Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

Figure 4 for Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

Abstract:Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios. Comprehensive experiments on MAE demonstrate that the existing ALLMs, while being powerful in comprehending primary audio elements in individual audio inputs, struggling to handle multi-audio scenarios. To this end, we propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios using discriminative learning on our proposed synthetic data. The results demonstrate that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations. The proposed MALLM opens the door for ALLMs towards multi-audio processing era and brings us closer to replicating human auditory capabilities in machines.

* EMNLP24 Findings

Via

Access Paper or Ask Questions