Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Konstantinos Markopoulos

Pseudo-Cepstrum: Pitch Modification for Mel-Based Neural Vocoders

Dec 18, 2025

Nikolaos Ellinas, Alexandra Vioni, Panos Kakoulidis, Georgios Vamvoukakis, Myrsini Christidou, Konstantinos Markopoulos, Junkwang Oh, Gunu Jho, Inchul Hwang, Aimilios Chalamandaris(+1 more)

Figure 1 for Pseudo-Cepstrum: Pitch Modification for Mel-Based Neural Vocoders

Figure 2 for Pseudo-Cepstrum: Pitch Modification for Mel-Based Neural Vocoders

Figure 3 for Pseudo-Cepstrum: Pitch Modification for Mel-Based Neural Vocoders

Figure 4 for Pseudo-Cepstrum: Pitch Modification for Mel-Based Neural Vocoders

Abstract:This paper introduces a cepstrum-based pitch modification method that can be applied to any mel-spectrogram representation. As a result, this method is compatible with any mel-based vocoder without requiring any additional training or changes to the model. This is achieved by directly modifying the cepstrum feature space in order to shift the harmonic structure to the desired target. The spectrogram magnitude is computed via the pseudo-inverse mel transform, then converted to the cepstrum by applying DCT. In this domain, the cepstral peak is shifted without having to estimate its position and the modified mel is recomputed by applying IDCT and mel-filterbank. These pitch-shifted mel-spectrogram features can be converted to speech with any compatible vocoder. The proposed method is validated experimentally with objective and subjective metrics on various state-of-the-art neural vocoders as well as in comparison with traditional pitch modification methods.

Via

Access Paper or Ask Questions

Improved Text Emotion Prediction Using Combined Valence and Arousal Ordinal Classification

Apr 02, 2024

Michael Mitsios, Georgios Vamvoukakis, Georgia Maniati, Nikolaos Ellinas, Georgios Dimitriou, Konstantinos Markopoulos, Panos Kakoulidis, Alexandra Vioni, Myrsini Christidou, Junkwang Oh(+6 more)

Figure 1 for Improved Text Emotion Prediction Using Combined Valence and Arousal Ordinal Classification

Figure 2 for Improved Text Emotion Prediction Using Combined Valence and Arousal Ordinal Classification

Figure 3 for Improved Text Emotion Prediction Using Combined Valence and Arousal Ordinal Classification

Figure 4 for Improved Text Emotion Prediction Using Combined Valence and Arousal Ordinal Classification

Abstract:Emotion detection in textual data has received growing interest in recent years, as it is pivotal for developing empathetic human-computer interaction systems. This paper introduces a method for categorizing emotions from text, which acknowledges and differentiates between the diversified similarities and distinctions of various emotions. Initially, we establish a baseline by training a transformer-based model for standard emotion classification, achieving state-of-the-art performance. We argue that not all misclassifications are of the same importance, as there are perceptual similarities among emotional classes. We thus redefine the emotion labeling problem by shifting it from a traditional classification model to an ordinal classification one, where discrete emotions are arranged in a sequential order according to their valence levels. Finally, we propose a method that performs ordinal classification in the two-dimensional emotion space, considering both valence and arousal scales. The results show that our approach not only preserves high accuracy in emotion prediction but also significantly reduces the magnitude of errors in cases of misclassification.

Via

Access Paper or Ask Questions

Generating Gender-Ambiguous Text-to-Speech Voices

Nov 01, 2022

Konstantinos Markopoulos, Georgia Maniati, Georgios Vamvoukakis, Nikolaos Ellinas, Karolos Nikitaras, Konstantinos Klapsas, Georgios Vardaxoglou, Panos Kakoulidis, June Sig Sung, Inchul Hwang(+3 more)

Figure 1 for Generating Gender-Ambiguous Text-to-Speech Voices

Figure 2 for Generating Gender-Ambiguous Text-to-Speech Voices

Figure 3 for Generating Gender-Ambiguous Text-to-Speech Voices

Figure 4 for Generating Gender-Ambiguous Text-to-Speech Voices

Abstract:The gender of a voice assistant or any voice user interface is a central element of its perceived identity. While a female voice is a common choice, there is an increasing interest in alternative approaches where the gender is ambiguous rather than clearly identifying as female or male. This work addresses the task of generating gender-ambiguous text-to-speech (TTS) voices that do not correspond to any existing person. This is accomplished by sampling from a latent speaker embeddings' space that was formed while training a multilingual, multi-speaker TTS system on data from multiple male and female speakers. Various options are investigated regarding the sampling process. In our experiments, the effects of different sampling choices on the gender ambiguity and the naturalness of the resulting voices are evaluated. The proposed method is shown able to efficiently generate novel speakers that are superior to a baseline averaged speaker embedding. To our knowledge, this is the first systematic approach that can reliably generate a range of gender-ambiguous voices to meet diverse user requirements.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

Oct 31, 2022

Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Georgia Maniati, Panos Kakoulidis, June Sig Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis

Figure 1 for Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

Figure 2 for Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

Figure 3 for Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

Figure 4 for Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

Abstract:This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language's pronunciation regardless of the original speaker's language. The model used is based on a non-attentive Tacotron architecture, where the decoder has been replaced with a normalizing flow network conditioned on the speaker identity, allowing both TTS and voice conversion (VC) to be performed by the same model due to the inherent linguistic content and speaker identity disentanglement. When used in a cross-lingual setting, acoustic features are initially produced with a native speaker of the target language and then voice conversion is applied by the same model in order to convert these features to the target speaker's voice. We verify through objective and subjective evaluations that our method can have benefits compared to baseline cross-lingual synthesis. By including speakers averaging 7.5 minutes of speech, we also present positive results on low-resource scenarios.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

Fine-grained Noise Control for Multispeaker Speech Synthesis

Apr 11, 2022

Karolos Nikitaras, Georgios Vamvoukakis, Nikolaos Ellinas, Konstantinos Klapsas, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris, Pirros Tsiakoulis

Figure 1 for Fine-grained Noise Control for Multispeaker Speech Synthesis

Figure 2 for Fine-grained Noise Control for Multispeaker Speech Synthesis

Figure 3 for Fine-grained Noise Control for Multispeaker Speech Synthesis

Figure 4 for Fine-grained Noise Control for Multispeaker Speech Synthesis

Abstract:A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i.e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise.This paper proposes unsupervised, interpretable and fine-grained noise and prosody modeling. We incorporate adversarial training, representation bottleneck and utterance-to-frame modeling in order to learn frame-level noise representations. To the same end, we perform fine-grained prosody modeling via a Fully Hierarchical Variational AutoEncoder (FVAE) which additionally results in more expressive speech synthesis.

* submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Karaoker: Alignment-free singing voice synthesis with speech training data

Apr 08, 2022

Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios Chalamandaris

Figure 1 for Karaoker: Alignment-free singing voice synthesis with speech training data

Figure 2 for Karaoker: Alignment-free singing voice synthesis with speech training data

Figure 3 for Karaoker: Alignment-free singing voice synthesis with speech training data

Abstract:Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice following a multi-dimensional template extracted from a source waveform of an unseen speaker/singer. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. Except for multi-tasking, we also employ a Wasserstein GAN training scheme as well as new losses on the acoustic model's output to further refine the quality of the model.

* Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Self supervised learning for robust voice cloning

Apr 07, 2022

Konstantinos Klapsas, Nikolaos Ellinas, Karolos Nikitaras, Georgios Vamvoukakis, Panos Kakoulidis, Konstantinos Markopoulos, Spyros Raptis, June Sig Sung, Gunu Jho, Aimilios Chalamandaris(+1 more)

Figure 1 for Self supervised learning for robust voice cloning

Figure 2 for Self supervised learning for robust voice cloning

Figure 3 for Self supervised learning for robust voice cloning

Abstract:Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are applied to the vanilla algorithm. We further extend the augmentations in the training procedure to aid the resulting features to capture the speaker identity and to make them robust to noise and acoustic conditions. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture, aiming to achieve multispeaker speech synthesis without utilizing additional speaker features. This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice. Subjective and objective evaluations are used to validate the proposed model, as well as the robustness to the acoustic conditions of the target utterance.

* Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

Nov 19, 2021

Myrsini Christidou, Alexandra Vioni, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos, Panos Kakoulidis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

Figure 1 for Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

Figure 2 for Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

Figure 3 for Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

Figure 4 for Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

Abstract:This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control range and coverage. More specifically we employ data augmentation, F0 normalization, balanced clustering for duration, and speaker-independent prosodic clustering. These modifications enable fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. The model is also fine-tuned to unseen speakers with limited amounts of data and it is shown to maintain its prosody control capabilities, verifying that the speaker-independent prosodic clustering is effective. Experimental results verify that the model maintains high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.

* Proceedings of SPECOM 2021

Via

Access Paper or Ask Questions

Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

Nov 17, 2021

Konstantinos Markopoulos, Nikolaos Ellinas, Alexandra Vioni, Myrsini Christidou, Panos Kakoulidis, Georgios Vamvoukakis, Georgia Maniati, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis(+1 more)

Figure 1 for Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

Figure 2 for Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

Figure 3 for Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

Figure 4 for Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

Abstract:In this paper, a text-to-rapping/singing system is introduced, which can be adapted to any speaker's voice. It utilizes a Tacotron-based multispeaker acoustic model trained on read-only speech data and which provides prosody control at the phoneme level. Dataset augmentation and additional prosody manipulation based on traditional DSP algorithms are also investigated. The neural TTS model is fine-tuned to an unseen speaker's limited recordings, allowing rapping/singing synthesis with the target's speaker voice. The detailed pipeline of the system is described, which includes the extraction of the target pitch and duration values from an a capella song and their conversion into target speaker's valid range of notes before synthesis. An additional stage of prosodic manipulation of the output via WSOLA is also investigated for better matching the target duration values. The synthesized utterances can be mixed with an instrumental accompaniment track to produce a complete song. The proposed system is evaluated via subjective listening tests as well as in comparison to an available alternate system which also aims to produce synthetic singing voice from read-only training data. Results show that the proposed approach can produce high quality rapping/singing voice with increased naturalness.

* Proceedings of 11th ISCA Speech Synthesis Workshop (SSW 11)

Via

Access Paper or Ask Questions

Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

Nov 17, 2021

Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis

Figure 1 for Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

Figure 2 for Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

Figure 3 for Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

Abstract:The idea of using phonological features instead of phonemes as input to sequence-to-sequence TTS has been recently proposed for zero-shot multilingual speech synthesis. This approach is useful for code-switching, as it facilitates the seamless uttering of foreign text embedded in a stream of native text. In our work, we train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages, with the goal of achieving cross-lingual speaker adaptation. We first experiment with the effect of language phonological similarity on cross-lingual TTS of several source-target language combinations. Subsequently, we fine-tune the model with very limited data of a new speaker's voice in either a seen or an unseen language, and achieve synthetic speech of equal quality, while preserving the target speaker's identity. With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature. In the extreme case of only 2 available adaptation utterances, we find that our model behaves as a few-shot learner, as the performance is similar in both the seen and unseen adaptation language scenarios.

* Proceedings of INTERSPEECH 2021

Via

Access Paper or Ask Questions