Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tomoki Toda

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Jun 04, 2024

Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda(+1 more)

Figure 1 for CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Figure 2 for CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Figure 3 for CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Figure 4 for CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Abstract:Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesized using state-of-the-art methods from publicly accessible singing voice datasets. CtrSVDD includes 47.64 hours of bonafide and 260.34 hours of deepfake singing vocals, spanning 14 deepfake methods and involving 164 singer identities. We also present a baseline system with flexible front-end features, evaluated against a structured train/dev/eval split. The experiments show the importance of feature selection and highlight a need for generalization towards deepfake methods that deviate further from training distribution. The CtrSVDD dataset and baselines are publicly accessible.

* Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

Multi-speaker Text-to-speech Training with Speaker Anonymized Data

May 20, 2024

Wen-Chin Huang, Yi-Chiao Wu, Tomoki Toda

Abstract:The trend of scaling up speech generation models poses a threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this paper, we investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while maintaining other attributes. Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset, which is further used to train an end-to-end TTS model, VITS, to perform unseen speaker TTS during the testing phase. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data. Importantly, we found that UTMOS, a data-driven subjective rating predictor model, and GVD, a metric that measures the gain of voice distinctiveness, are good indicators of the downstream TTS performance. We summarize insights in the hope of helping future researchers determine the goodness of the SA system for multi-speaker TTS training.

* 5 pages. Submitted to Signal Processing Letters. Audio sample page: https://unilight.github.io/Publication-Demos/publications/sa-tts-spl/index.html

Via

Access Paper or Ask Questions

SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

May 08, 2024

You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

Figure 1 for SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

Figure 2 for SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

Figure 3 for SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

Figure 4 for SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

Abstract:The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specialized field requiring focused attention. To promote SVDD research, we recently proposed the "SVDD Challenge," the very first research challenge focusing on SVDD for lab-controlled and in-the-wild bonafide and deepfake singing voice recordings. The challenge will be held in conjunction with the 2024 IEEE Spoken Language Technology Workshop (SLT 2024).

* Evaluation plan of the SVDD Challenge @ SLT 2024

Via

Access Paper or Ask Questions

Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

Apr 10, 2024

Yuka Hashizume, Li Li, Atsushi Miyashita, Tomoki Toda

Figure 1 for Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

Figure 2 for Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

Figure 3 for Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

Figure 4 for Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment

Abstract:To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal as a query in search systems. Using separated instrumental sounds alternatively resulted in less accuracy due to artifacts. In this paper, we propose a method to compute similarities focusing on each instrumental sound with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we design a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which is trained by the triplet loss using masks. Experimental results have shown that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human consent, especially in drums and guitar.

Via

Access Paper or Ask Questions

Discriminative Neighborhood Smoothing for Generative Anomalous Sound Detection

Mar 18, 2024

Takuya Fujimura, Keisuke Imoto, Tomoki Toda

Figure 1 for Discriminative Neighborhood Smoothing for Generative Anomalous Sound Detection

Figure 2 for Discriminative Neighborhood Smoothing for Generative Anomalous Sound Detection

Figure 3 for Discriminative Neighborhood Smoothing for Generative Anomalous Sound Detection

Figure 4 for Discriminative Neighborhood Smoothing for Generative Anomalous Sound Detection

Abstract:We propose discriminative neighborhood smoothing of generative anomaly scores for anomalous sound detection. While the discriminative approach is known to achieve better performance than generative approaches often, we have found that it sometimes causes significant performance degradation due to the discrepancy between the training and test data, making it less robust than the generative approach. Our proposed method aims to compensate for the disadvantages of generative and discriminative approaches by combining them. Generative anomaly scores are smoothed using multiple samples with similar discriminative features to improve the performance of the generative approach in an ensemble manner while keeping its robustness. Experimental results show that our proposed method greatly improves the original generative method, including absolute improvement of 22% in AUC and robustly works, while a discriminative method suffers from the discrepancy.

* Submitted to EUSIPCO 2024

Via

Access Paper or Ask Questions

Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment

Mar 10, 2024

Yusuke Yasuda, Tomoki Toda

Figure 1 for Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment

Figure 2 for Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment

Figure 3 for Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment

Figure 4 for Automatic design optimization of preference-based subjective evaluation with online learning in crowdsourcing environment

Abstract:A preference-based subjective evaluation is a key method for evaluating generative media reliably. However, its huge combinations of pairs prohibit it from being applied to large-scale evaluation using crowdsourcing. To address this issue, we propose an automatic optimization method for preference-based subjective evaluation in terms of pair combination selections and allocation of evaluation volumes with online learning in a crowdsourcing environment. We use a preference-based online learning method based on a sorting algorithm to identify the total order of evaluation targets with minimum sample volumes. Our online learning algorithm supports parallel and asynchronous execution under fixed-budget conditions required for crowdsourcing. Our experiment on preference-based subjective evaluation of synthetic speech shows that our method successfully optimizes the test by reducing pair combinations from 351 to 83 and allocating optimal evaluation volumes for each pair ranging from 30 to 663 without compromising evaluation accuracies and wasting budget allocations.

Via

Access Paper or Ask Questions

MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, ASR Error Detection, and ASR Error Correction

Jan 24, 2024

Jiajun He, Xiaohan Shi, Xingfeng Li, Tomoki Toda

Abstract:The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker's emotion, with the text generally obtained through automatic speech recognition (ASR). An essential issue of this approach is that ASR errors from the text modality can worsen the performance of SER. Previous studies have proposed using an auxiliary ASR error detection task to adaptively assign weights of each word in ASR hypotheses. However, this approach has limited improvement potential because it does not address the coherence of semantic information in the text. Additionally, the inherent heterogeneity of different modalities leads to distribution gaps between their representations, making their fusion challenging. Therefore, in this paper, we incorporate two auxiliary tasks, ASR error detection (AED) and ASR error correction (AEC), to enhance the semantic coherence of ASR text, and further introduce a novel multi-modal fusion (MF) method to learn shared representations across modalities. We refer to our method as MF-AED-AEC. Experimental results indicate that MF-AED-AEC significantly outperforms the baseline model by a margin of 4.1\%.

* Accepted by ICASSP 2024

Via

Access Paper or Ask Questions

On the Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition

Nov 14, 2023

Xiaohan Shi, Jiajun He, Xingfeng Li, Tomoki Toda

Abstract:This paper proposes an efficient attempt to noisy speech emotion recognition (NSER). Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adopting the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. We first obtain intermediate layer information from the ASR model as a feature representation for emotional speech and then apply this representation for the downstream NSER task. Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

ed-cec: improving rare word recognition using asr postprocessing based on error detection and context-aware error correction

Oct 08, 2023

Jiajun He, Zekun Yang, Tomoki Toda

Figure 1 for ed-cec: improving rare word recognition using asr postprocessing based on error detection and context-aware error correction

Figure 2 for ed-cec: improving rare word recognition using asr postprocessing based on error detection and context-aware error correction

Figure 3 for ed-cec: improving rare word recognition using asr postprocessing based on error detection and context-aware error correction

Figure 4 for ed-cec: improving rare word recognition using asr postprocessing based on error detection and context-aware error correction

Abstract:Automatic speech recognition (ASR) systems often encounter difficulties in accurately recognizing rare words, leading to errors that can have a negative impact on downstream tasks such as keyword spotting, intent detection, and text summarization. To address this challenge, we present a novel ASR postprocessing method that focuses on improving the recognition of rare words through error detection and context-aware error correction. Our method optimizes the decoding process by targeting only the predicted error positions, minimizing unnecessary computations. Moreover, we leverage a rare word list to provide additional contextual knowledge, enabling the model to better correct rare words. Experimental results across five datasets demonstrate that our proposed method achieves significantly lower word error rates (WERs) than previous approaches while maintaining a reasonable inference speed. Furthermore, our approach exhibits promising robustness across different ASR systems.

* 6 pages, 5 figures, conference

Via

Access Paper or Ask Questions

A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023

Oct 08, 2023

Ryuichi Yamamoto, Reo Yoneyama, Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda

Figure 1 for A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023

Figure 2 for A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023

Figure 3 for A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023

Figure 4 for A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023

Abstract:This paper presents our systems (denoted as T13) for the singing voice conversion challenge (SVCC) 2023. For both in-domain and cross-domain English singing voice conversion (SVC) tasks (Task 1 and Task 2), we adopt a recognition-synthesis approach with self-supervised learning-based representation. To achieve data-efficient SVC with a limited amount of target singer/speaker's data (150 to 160 utterances for SVCC 2023), we first train a diffusion-based any-to-any voice conversion model using publicly available large-scale 750 hours of speech and singing data. Then, we finetune the model for each target singer/speaker of Task 1 and Task 2. Large-scale listening tests conducted by SVCC 2023 show that our T13 system achieves competitive naturalness and speaker similarity for the harder cross-domain SVC (Task 2), which implies the generalization ability of our proposed method. Our objective evaluation results show that using large datasets is particularly beneficial for cross-domain SVC.

* Accepted to ASRU 2023

Via

Access Paper or Ask Questions