Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shinji Watanabe

Carnegie Mellon University

VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Jun 13, 2024

Yifeng Yu, Jiatong Shi, Yuning Wu, Shinji Watanabe

Figure 1 for VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Figure 2 for VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Figure 3 for VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Figure 4 for VISinger2+: End-to-End Singing Voice Synthesis Augmented by Self-Supervised Learning Representation

Abstract:Singing Voice Synthesis (SVS) has witnessed significant advancements with the advent of deep learning techniques. However, a significant challenge in SVS is the scarcity of labeled singing voice data, which limits the effectiveness of supervised learning methods. In response to this challenge, this paper introduces a novel approach to enhance the quality of SVS by leveraging unlabeled data from pre-trained self-supervised learning models. Building upon the existing VISinger2 framework, this study integrates additional spectral feature information into the system to enhance its performance. The integration aims to harness the rich acoustic features from the pre-trained models, thereby enriching the synthesis and yielding a more natural and expressive singing voice. Experimental results in various corpora demonstrate the efficacy of this approach in improving the overall quality of synthesized singing voices in both objective and subjective metrics.

* 4 pages, 2 figures

Via

Access Paper or Ask Questions

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

Jun 12, 2024

Jiatong Shi, Shih-Heng Wang, William Chen, Martijn Bartelds, Vanya Bannihatti Kumar, Jinchuan Tian, Xuankai Chang, Dan Jurafsky, Karen Livescu, Hung-yi Lee(+1 more)

Figure 1 for ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

Figure 2 for ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

Figure 3 for ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

Figure 4 for ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets

Abstract:ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB~2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches. We find performance improvements over the setup of ML-SUPERB. However, performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches to improve multilingual ASR performance.

* Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

Self-Supervised Speech Representations are More Phonetic than Semantic

Jun 12, 2024

Kwanghee Choi, Ankita Pasad, Tomohiko Nakamura, Satoru Fukayama, Karen Livescu, Shinji Watanabe

Figure 1 for Self-Supervised Speech Representations are More Phonetic than Semantic

Figure 2 for Self-Supervised Speech Representations are More Phonetic than Semantic

Figure 3 for Self-Supervised Speech Representations are More Phonetic than Semantic

Figure 4 for Self-Supervised Speech Representations are More Phonetic than Semantic

Abstract:Self-supervised speech models (S3Ms) have become an effective backbone for speech applications. Various analyses suggest that S3Ms encode linguistic properties. In this work, we seek a more fine-grained analysis of the word-level linguistic properties encoded in S3Ms. Specifically, we curate a novel dataset of near homophone (phonetically similar) and synonym (semantically similar) word pairs and measure the similarities between S3M word representation pairs. Our study reveals that S3M representations consistently and significantly exhibit more phonetic than semantic similarity. Further, we question whether widely used intent classification datasets such as Fluent Speech Commands and Snips Smartlights are adequate for measuring semantic abilities. Our simple baseline, using only the word identity, surpasses S3M-based models. This corroborates our findings and suggests that high scores on these datasets do not necessarily guarantee the presence of semantic content.

* Accepted to Interspeech 2024. Source code at https://github.com/juice500ml/phonetic_semantic_probing

Via

Access Paper or Ask Questions

Neural Blind Source Separation and Diarization for Distant Speech Recognition

Jun 12, 2024

Yoshiaki Bando, Tomohiko Nakamura, Shinji Watanabe

Figure 1 for Neural Blind Source Separation and Diarization for Distant Speech Recognition

Figure 2 for Neural Blind Source Separation and Diarization for Distant Speech Recognition

Figure 3 for Neural Blind Source Separation and Diarization for Distant Speech Recognition

Figure 4 for Neural Blind Source Separation and Diarization for Distant Speech Recognition

Abstract:This paper presents a neural method for distant speech recognition (DSR) that jointly separates and diarizes speech mixtures without supervision by isolated signals. A standard separation method for multi-talker DSR is a statistical multichannel method called guided source separation (GSS). While GSS does not require signal-level supervision, it relies on speaker diarization results to handle unknown numbers of active speakers. To overcome this limitation, we introduce and train a neural inference model in a weakly-supervised manner, employing the objective function of a statistical separation method. This training requires only multichannel mixtures and their temporal annotations of speaker activities. In contrast to GSS, the trained model can jointly separate and diarize speech mixtures without any auxiliary information. The experiments with the AMI corpus show that our method outperforms GSS with oracle diarization results regarding word error rates. The code is available online.

* 5 pages, 3 figures, accepted to INTERSPEECH 2024

Via

Access Paper or Ask Questions

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation

Jun 11, 2024

Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann

Figure 1 for EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation

Figure 2 for EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation

Figure 3 for EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation

Figure 4 for EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation

Abstract:We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality speech dataset comprising 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data. The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech. We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics. In addition, we conduct a listening test with 20 participants for the speech enhancement task, where a generative method is preferred. We introduce a blind test set that allows for automatic online evaluation of uploaded data. Dataset download links and automatic evaluation server can be found online.

* Accepted at Interspeech 2024

Via

Access Paper or Ask Questions

The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

Jun 11, 2024

Xuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang, Yihan Wu, Shinji Watanabe, Yossi Adi, Xie Chen, Qin Jin

Figure 1 for The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

Figure 2 for The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

Figure 3 for The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

Figure 4 for The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

Abstract:Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of discrete units in various applications such as speech compression and restoration, speech recognition, and speech generation. To foster exploration in this domain, we introduce the Interspeech 2024 Challenge, which focuses on new speech processing benchmarks using discrete units. It encompasses three pivotal tasks, namely multilingual automatic speech recognition, text-to-speech, and singing voice synthesis, and aims to assess the potential applicability of discrete units in these tasks. This paper outlines the challenge designs and baseline descriptions. We also collate baseline and selected submission systems, along with preliminary findings, offering valuable contributions to future research in this evolving field.

* This manuscript has been accepted by Interspeech2024

Via

Access Paper or Ask Questions

To what extent can ASV systems naturally defend against spoofing attacks?

Jun 08, 2024

Jee-weon Jung, Xin Wang, Nicholas Evans, Shinji Watanabe, Hye-jin Shim, Hemlata Tak, Sidhhant Arora, Junichi Yamagishi, Joon Son Chung

Abstract:The current automatic speaker verification (ASV) task involves making binary decisions on two types of trials: target and non-target. However, emerging advancements in speech generation technology pose significant threats to the reliability of ASV systems. This study investigates whether ASV effortlessly acquires robustness against spoofing attacks (i.e., zero-shot capability) by systematically exploring diverse ASV systems and spoofing attacks, ranging from traditional to cutting-edge techniques. Through extensive analyses conducted on eight distinct ASV systems and 29 spoofing attack systems, we demonstrate that the evolution of ASV inherently incorporates defense mechanisms against spoofing attacks. Nevertheless, our findings also underscore that the advancement of spoofing attacks far outpaces that of ASV systems, hence necessitating further research on spoofing-robust ASV methodologies.

* 5 pages, 3 figures, 3 tables, Interspeech 2024

Via

Access Paper or Ask Questions

URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Jun 07, 2024

Wangyou Zhang, Robin Scheibler, Kohei Saijo, Samuele Cornell, Chenda Li, Zhaoheng Ni, Anurag Kumar, Jan Pirklbauer, Marvin Sach, Shinji Watanabe(+2 more)

Figure 1 for URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Figure 2 for URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Figure 3 for URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Figure 4 for URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Abstract:The last decade has witnessed significant advancements in deep learning-based speech enhancement (SE). However, most existing SE research has limitations on the coverage of SE sub-tasks, data diversity and amount, and evaluation metrics. To fill this gap and promote research toward universal SE, we establish a new SE challenge, named URGENT, to focus on the universality, robustness, and generalizability of SE. We aim to extend the SE definition to cover different sub-tasks to explore the limits of SE models, starting from denoising, dereverberation, bandwidth extension, and declipping. A novel framework is proposed to unify all these sub-tasks in a single model, allowing the use of all existing SE approaches. We collected public speech and noise data from different domains to construct diverse evaluation data. Finally, we discuss the insights gained from our preliminary baseline experiments based on both generative and discriminative SE methods with 12 curated metrics.

* 6 pages, 3 figures, 3 tables. Accepted by Interspeech 2024. An extended version of the accepted manuscript with appendix

Via

Access Paper or Ask Questions

Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Jun 06, 2024

Wangyou Zhang, Kohei Saijo, Jee-weon Jung, Chenda Li, Shinji Watanabe, Yanmin Qian

Figure 1 for Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Figure 2 for Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Figure 3 for Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Figure 4 for Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement

Abstract:Deep learning-based speech enhancement (SE) models have achieved impressive performance in the past decade. Numerous advanced architectures have been designed to deliver state-of-the-art performance; however, their scalability potential remains unrevealed. Meanwhile, the majority of research focuses on small-sized datasets with restricted diversity, leading to a plateau in performance improvement. In this paper, we aim to provide new insights for addressing the above issues by exploring the scalability of SE models in terms of architectures, model sizes, compute budgets, and dataset sizes. Our investigation involves several popular SE architectures and speech data from different domains. Experiments reveal both similarities and distinctions between the scaling effects in SE and other tasks such as speech recognition. These findings further provide insights into the under-explored SE directions, e.g., larger-scale multi-domain corpora and efficiently scalable architectures.

* 5 pages, 3 figures, 4 tables, Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Jun 05, 2024

Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Brian Yan, Jiatong Shi, Yifan Peng, Shinji Watanabe

Figure 1 for 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Figure 2 for 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Figure 3 for 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Figure 4 for 4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

Abstract:End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predict models. Each network architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on application requirements. Instead of building separate models, we propose a joint modeling scheme where four decoders (CTC, RNN-T, attention, and mask-predict) share the same encoder -- we refer to this as 4D modeling. The 4D model is trained using multitask learning, which will bring model regularization and maximize the model robustness thanks to their complementary properties. To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning. In addition, we propose three novel one-pass beam search algorithms by combining three decoders (CTC, RNN-T, and attention) to further improve performance. These three beam search algorithms differ in which decoder is used as the primary decoder. We carefully evaluate the performance and computational tradeoffs associated with each algorithm. Experimental results demonstrate that the jointly trained 4D model outperforms the E2E-ASR models trained with only one individual decoder. Furthermore, we demonstrate that the proposed one-pass beam search algorithm outperforms the previously proposed CTC/attention decoding.

* submitted to IEEE/ACM Transactions on Audio Speech and Language Processing

Via

Access Paper or Ask Questions