Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daria Diatlova

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech

Jul 17, 2025

Maksim Borisov, Egor Spirin, Daria Diatlova

Abstract:Current expressive speech synthesis models are constrained by the limited availability of open-source datasets containing diverse nonverbal vocalizations (NVs). In this work, we introduce NonverbalTTS (NVTTS), a 17-hour open-access dataset annotated with 10 types of NVs (e.g., laughter, coughs) and 8 emotional categories. The dataset is derived from popular sources, VoxCeleb and Expresso, using automated detection followed by human validation. We propose a comprehensive pipeline that integrates automatic speech recognition (ASR), NV tagging, emotion classification, and a fusion algorithm to merge transcriptions from multiple annotators. Fine-tuning open-source text-to-speech (TTS) models on the NVTTS dataset achieves parity with closed-source systems such as CosyVoice2, as measured by both human evaluation and automatic metrics, including speaker similarity and NV fidelity. By releasing NVTTS and its accompanying annotation guidelines, we address a key bottleneck in expressive TTS research. The dataset is available at https://huggingface.co/datasets/deepvk/NonverbalTTS.

Via

Access Paper or Ask Questions

METR: Image Watermarking with Large Number of Unique Messages

Aug 15, 2024

Alexander Varlamov, Daria Diatlova, Egor Spirin

Figure 1 for METR: Image Watermarking with Large Number of Unique Messages

Figure 2 for METR: Image Watermarking with Large Number of Unique Messages

Figure 3 for METR: Image Watermarking with Large Number of Unique Messages

Figure 4 for METR: Image Watermarking with Large Number of Unique Messages

Abstract:Improvements in diffusion models have boosted the quality of image generation, which has led researchers, companies, and creators to focus on improving watermarking algorithms. This provision would make it possible to clearly identify the creators of generative art. The main challenges that modern watermarking algorithms face have to do with their ability to withstand attacks and encrypt many unique messages, such as user IDs. In this paper, we present METR: Message Enhanced Tree-Ring, which is an approach that aims to address these challenges. METR is built on the Tree-Ring watermarking algorithm, a technique that makes it possible to encode multiple distinct messages without compromising attack resilience or image quality. This ensures the suitability of this watermarking algorithm for any Diffusion Model. In order to surpass the limitations on the quantity of encoded messages, we propose METR++, an enhanced version of METR. This approach, while limited to the Latent Diffusion Model architecture, is designed to inject a virtually unlimited number of unique messages. We demonstrate its robustness to attacks and ability to encrypt many unique messages while preserving image quality, which makes METR and METR++ hold great potential for practical applications in real-world settings. Our code is available at https://github.com/deepvk/metr

* 14 pages, 9 figures, code is available at https://github.com/deepvk/metr

Via

Access Paper or Ask Questions

Adapting WavLM for Speech Emotion Recognition

May 07, 2024

Daria Diatlova, Anton Udalov, Vitalii Shutov, Egor Spirin

Abstract:Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.

Via

Access Paper or Ask Questions

EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

Jun 28, 2023

Daria Diatlova, Vitaly Shutov

Figure 1 for EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

Figure 2 for EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

Figure 3 for EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

Figure 4 for EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech

Abstract:State-of-the-art speech synthesis models try to get as close as possible to the human voice. Hence, modelling emotions is an essential part of Text-To-Speech (TTS) research. In our work, we selected FastSpeech2 as the starting point and proposed a series of modifications for synthesizing emotional speech. According to automatic and human evaluation, our model, EmoSpeech, surpasses existing models regarding both MOS score and emotion recognition accuracy in generated speech. We provided a detailed ablation study for every extension to FastSpeech2 architecture that forms EmoSpeech. The uneven distribution of emotions in the text is crucial for better, synthesized speech and intonation perception. Our model includes a conditioning mechanism that effectively handles this issue by allowing emotions to contribute to each phone with varying intensity levels. The human assessment indicates that proposed modifications generate audio with higher MOS and emotional expressiveness.

Via

Access Paper or Ask Questions