Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zhiyao Duan

GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Jun 15, 2024

Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong, Pinxin Liu, Zhiyao Duan

Figure 1 for GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Figure 2 for GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Abstract:Expressive speech synthesis aims to generate speech that captures a wide range of para-linguistic features, including emotion and articulation, though current research primarily emphasizes emotional aspects over the nuanced articulatory features mastered by professional voice actors. Inspired by this, we explore expressive speech synthesis through the lens of articulatory phonetics. Specifically, we define a framework with three dimensions: Glottalization, Tenseness, and Resonance (GTR), to guide the synthesis at the voice production level. With this framework, we record a high-quality speech dataset named GTR-Voice, featuring 20 Chinese sentences articulated by a professional voice actor across 125 distinct GTR combinations. We verify the framework and GTR annotations through automatic classification and listening tests, and demonstrate precise controllability along the GTR dimensions on two fine-tuned expressive TTS models. We open-source the dataset and TTS models.

Via

Access Paper or Ask Questions

Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Jun 15, 2024

Zehua Kcriss Li, Meiying Melissa Chen, Yi Zhong, Pinxin Liu, Zhiyao Duan

Figure 1 for Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Figure 2 for Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Via

Access Paper or Ask Questions

CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Jun 04, 2024

Yongyi Zang, Jiatong Shi, You Zhang, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Shengyuan Xu, Wenxiao Zhao, Jing Guo, Tomoki Toda(+1 more)

Figure 1 for CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Figure 2 for CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Figure 3 for CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Figure 4 for CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection

Abstract:Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesized using state-of-the-art methods from publicly accessible singing voice datasets. CtrSVDD includes 47.64 hours of bonafide and 260.34 hours of deepfake singing vocals, spanning 14 deepfake methods and involving 164 singer identities. We also present a baseline system with flexible front-end features, evaluated against a structured train/dev/eval split. The experiments show the importance of feature selection and highlight a need for generalization towards deepfake methods that deviate further from training distribution. The CtrSVDD dataset and baselines are publicly accessible.

* Accepted by Interspeech 2024

Via

Access Paper or Ask Questions

SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

May 08, 2024

You Zhang, Yongyi Zang, Jiatong Shi, Ryuichi Yamamoto, Jionghao Han, Yuxun Tang, Tomoki Toda, Zhiyao Duan

Figure 1 for SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

Figure 2 for SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

Figure 3 for SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

Figure 4 for SVDD Challenge 2024: A Singing Voice Deepfake Detection Challenge Evaluation Plan

Abstract:The rapid advancement of AI-generated singing voices, which now closely mimic natural human singing and align seamlessly with musical scores, has led to heightened concerns for artists and the music industry. Unlike spoken voice, singing voice presents unique challenges due to its musical nature and the presence of strong background music, making singing voice deepfake detection (SVDD) a specialized field requiring focused attention. To promote SVDD research, we recently proposed the "SVDD Challenge," the very first research challenge focusing on SVDD for lab-controlled and in-the-wild bonafide and deepfake singing voice recordings. The challenge will be held in conjunction with the 2024 IEEE Spoken Language Technology Workshop (SLT 2024).

* Evaluation plan of the SVDD Challenge @ SLT 2024

Via

Access Paper or Ask Questions

Scoring Intervals using Non-Hierarchical Transformer For Automatic Piano Transcription

Apr 17, 2024

Yujia Yan, Zhiyao Duan

Figure 1 for Scoring Intervals using Non-Hierarchical Transformer For Automatic Piano Transcription

Figure 2 for Scoring Intervals using Non-Hierarchical Transformer For Automatic Piano Transcription

Figure 3 for Scoring Intervals using Non-Hierarchical Transformer For Automatic Piano Transcription

Figure 4 for Scoring Intervals using Non-Hierarchical Transformer For Automatic Piano Transcription

Abstract:The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription. In this framework, all events (notes or pedals) are represented as closed intervals tied to specific event types. The neural semi-CRF approach requires an interval scoring matrix that assigns a score for every candidate interval. However, designing an efficient and expressive architecture for scoring intervals is not trivial. In this paper, we introduce a simple method for scoring intervals using scaled inner product operations that resemble how attention scoring is done in transformers. We show theoretically that, due to the special structure from encoding the non-overlapping intervals, under a mild condition, the inner product operations are expressive enough to represent an ideal scoring matrix that can yield the correct transcription result. We then demonstrate that an encoder-only non-hierarchical transformer backbone, operating only on a low-time-resolution feature map, is capable of transcribing piano notes and pedals with high accuracy and time precision. The experiment shows that our approach achieves the new state-of-the-art performance across all subtasks in terms of the F1 measure on the Maestro dataset.

* Fixed Typos

Via

Access Paper or Ask Questions

MusicHiFi: Fast High-Fidelity Stereo Vocoding

Mar 20, 2024

Ge Zhu, Juan-Pablo Caceres, Zhiyao Duan, Nicholas J. Bryan

Figure 1 for MusicHiFi: Fast High-Fidelity Stereo Vocoding

Figure 2 for MusicHiFi: Fast High-Fidelity Stereo Vocoding

Figure 3 for MusicHiFi: Fast High-Fidelity Stereo Vocoding

Figure 4 for MusicHiFi: Fast High-Fidelity Stereo Vocoding

Abstract:Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at https://MusicHiFi.github.io/web/.

Via

Access Paper or Ask Questions

Toward Fully Self-Supervised Multi-Pitch Estimation

Feb 23, 2024

Frank Cwitkowitz, Zhiyao Duan

Figure 1 for Toward Fully Self-Supervised Multi-Pitch Estimation

Figure 2 for Toward Fully Self-Supervised Multi-Pitch Estimation

Figure 3 for Toward Fully Self-Supervised Multi-Pitch Estimation

Figure 4 for Toward Fully Self-Supervised Multi-Pitch Estimation

Abstract:Multi-pitch estimation is a decades-long research problem involving the detection of pitch activity associated with concurrent musical events within multi-instrument mixtures. Supervised learning techniques have demonstrated solid performance on more narrow characterizations of the task, but suffer from limitations concerning the shortage of large-scale and diverse polyphonic music datasets with multi-pitch annotations. We present a suite of self-supervised learning objectives for multi-pitch estimation, which encourage the concentration of support around harmonics, invariance to timbral transformations, and equivariance to geometric transformations. These objectives are sufficient to train an entirely convolutional autoencoder to produce multi-pitch salience-grams directly, without any fine-tuning. Despite training exclusively on a collection of synthetic single-note audio samples, our fully self-supervised framework generalizes to polyphonic music mixtures, and achieves performance comparable to supervised models trained on conventional multi-pitch datasets.

Via

Access Paper or Ask Questions

Cacophony: An Improved Contrastive Audio-Text Model

Feb 10, 2024

Ge Zhu, Zhiyao Duan

Figure 1 for Cacophony: An Improved Contrastive Audio-Text Model

Figure 2 for Cacophony: An Improved Contrastive Audio-Text Model

Figure 3 for Cacophony: An Improved Contrastive Audio-Text Model

Figure 4 for Cacophony: An Improved Contrastive Audio-Text Model

Abstract:Despite recent improvements in audio-text modeling, audio-text contrastive models still lag behind their image-text counterparts in scale and performance. We propose a method to improve both the scale and the training of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset consisting of over 13,000 hours of text-labeled audio, aided by large language model (LLM) processing and audio captioning. Further, we employ an masked autoencoder (MAE) pre-pretraining phase with random patch dropout, which allows us to both scale unlabeled audio datasets and train efficiently with variable length audio. After MAE pre-pretraining of our audio encoder, we train a contrastive model with an auxiliary captioning objective. Our final model, which we name Cacophony, achieves state-of-the-art performance on audio-text retrieval tasks, and exhibits competitive results on other downstream tasks such as zero-shot classification.

* Work in Progress

Via

Access Paper or Ask Questions

Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech

Nov 24, 2023

Enting Zhou, You Zhang, Zhiyao Duan

Figure 1 for Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech

Figure 2 for Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech

Figure 3 for Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech

Figure 4 for Learning Arousal-Valence Representation from Categorical Emotion Labels of Speech

Abstract:Dimensional representations of speech emotions such as the arousal-valence (AV) representation provide a continuous and fine-grained description and control than their categorical counterparts. They have wide applications in tasks such as dynamic emotion understanding and expressive text-to-speech synthesis. Existing methods that predict the dimensional emotion representation from speech cast it as a supervised regression task. These methods face data scarcity issues, as dimensional annotations are much harder to acquire than categorical labels. In this work, we propose to learn the AV representation from categorical emotion labels of speech. We start by learning a rich and emotion-relevant high-dimensional speech feature representation using self-supervised pre-training and emotion classification fine-tuning. This representation is then mapped to the 2D AV space according to psychological findings through anchored dimensionality reduction. Experiments show that our method achieves a Concordance Correlation Coefficient (CCC) performance comparable to state-of-the-art supervised regression methods on IEMOCAP without leveraging ground-truth AV annotations during training. This validates our proposed approach on AV prediction. Furthermore, visualization of AV predictions on MEAD and EmoDB datasets shows the interpretability of the learned AV representations.

Via

Access Paper or Ask Questions

EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

Nov 18, 2023

Ge Zhu, Yutong Wen, Marc-André Carbonneau, Zhiyao Duan

Figure 1 for EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

Figure 2 for EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

Figure 3 for EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

Figure 4 for EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis

Abstract:Audio diffusion models can synthesize a wide variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. This poses challenges when generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fr\'echet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data. Project page: https://agentcooper2002.github.io/EDMSound/

* Accepted at NeurIPS Workshop: Machine Learning for Audio (Camera Ready)

Via

Access Paper or Ask Questions