Alert button
Picture for Nicola Pia

Nicola Pia

Alert button

SEFGAN: Harvesting the Power of Normalizing Flows and GANs for Efficient High-Quality Speech Enhancement

Dec 04, 2023
Martin Strauss, Nicola Pia, Nagashree K. S. Rao, Bernd Edler

This paper proposes SEFGAN, a Deep Neural Network (DNN) combining maximum likelihood training and Generative Adversarial Networks (GANs) for efficient speech enhancement (SE). For this, a DNN is trained to synthesize the enhanced speech conditioned on noisy speech using a Normalizing Flow (NF) as generator in a GAN framework. While the combination of likelihood models and GANs is not trivial, SEFGAN demonstrates that a hybrid adversarial and maximum likelihood training approach enables the model to maintain high quality audio generation and log-likelihood estimation. Our experiments indicate that this approach strongly outperforms the baseline NF-based model without introducing additional complexity to the enhancement network. A comparison using computational metrics and a listening experiment reveals that SEFGAN is competitive with other state-of-the-art models.

* Preprint. Accepted to IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2023 
Viaarxiv icon

Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation

Jun 16, 2023
Kishor Kayyar Lakshminarayana, Christian Dittmar, Nicola Pia, Emanuël Habets

Figure 1 for Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation
Figure 2 for Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation
Figure 3 for Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation
Figure 4 for Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation

Many neural text-to-speech architectures can synthesize nearly natural speech from text inputs. These architectures must be trained with tens of hours of annotated and high-quality speech data. Compiling such large databases for every new voice requires a lot of time and effort. In this paper, we describe a method to extend the popular Tacotron-2 architecture and its training with data augmentation to enable single-speaker synthesis using a limited amount of specific training data. In contrast to elaborate augmentation methods proposed in the literature, we use simple stationary noises for data augmentation. Our extension is easy to implement and adds almost no computational overhead during training and inference. Using only two hours of training data, our approach was rated by human listeners to be on par with the baseline Tacotron-2 trained with 23.5 hours of LJSpeech data. In addition, we tested our model with a semantically unpredictable sentences test, which showed that both models exhibit similar intelligibility levels.

* Accepted for publication at EUSIPCO-2023, Helsinki 
Viaarxiv icon

NESC: Robust Neural End-2-End Speech Coding with GANs

Jul 07, 2022
Nicola Pia, Kishan Gupta, Srikanth Korse, Markus Multrus, Guillaume Fuchs

Figure 1 for NESC: Robust Neural End-2-End Speech Coding with GANs
Figure 2 for NESC: Robust Neural End-2-End Speech Coding with GANs
Figure 3 for NESC: Robust Neural End-2-End Speech Coding with GANs
Figure 4 for NESC: Robust Neural End-2-End Speech Coding with GANs

Neural networks have proven to be a formidable tool to tackle the problem of speech coding at very low bit rates. However, the design of a neural coder that can be operated robustly under real-world conditions remains a major challenge. Therefore, we present Neural End-2-End Speech Codec (NESC) a robust, scalable end-to-end neural speech codec for high-quality wideband speech coding at 3 kbps. The encoder uses a new architecture configuration, which relies on our proposed Dual-PathConvRNN (DPCRNN) layer, while the decoder architecture is based on our previous work Streamwise-StyleMelGAN. Our subjective listening tests on clean and noisy speech show that NESC is particularly robust to unseen conditions and signal perturbations.

* Paper accepted to Interspeech 2022 Please check our demo at: https://fhgspco.github.io/nesc/ 
Viaarxiv icon

PostGAN: A GAN-Based Post-Processor to Enhance the Quality of Coded Speech

Jan 31, 2022
Srikanth Korse, Nicola Pia, Kishan Gupta, Guillaume Fuchs

Figure 1 for PostGAN: A GAN-Based Post-Processor to Enhance the Quality of Coded Speech
Figure 2 for PostGAN: A GAN-Based Post-Processor to Enhance the Quality of Coded Speech
Figure 3 for PostGAN: A GAN-Based Post-Processor to Enhance the Quality of Coded Speech
Figure 4 for PostGAN: A GAN-Based Post-Processor to Enhance the Quality of Coded Speech

The quality of speech coded by transform coding is affected by various artefacts especially when bitrates to quantize the frequency components become too low. In order to mitigate these coding artefacts and enhance the quality of coded speech, a post-processor that relies on a-priori information transmitted from the encoder is traditionally employed at the decoder side. In recent years, several data-driven post-postprocessors have been proposed which were shown to outperform traditional approaches. In this paper, we propose PostGAN, a GAN-based neural post-processor that operates in the sub-band domain and relies on the U-Net architecture and a learned affine transform. It has been tested on the recently standardized low-complexity, low-delay bluetooth codec (LC3) for wideband speech at the lowest bitrate (16 kbit/s). Subjective evaluations and objective scores show that the newly introduced post-processor surpasses previously published methods and can improve the quality of coded speech by around 20 MUSHRA points.

* Accepted to ICASSP 2022 
Viaarxiv icon

A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate

Aug 09, 2021
Ahmed Mustafa, Jan Büthe, Srikanth Korse, Kishan Gupta, Guillaume Fuchs, Nicola Pia

Figure 1 for A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate
Figure 2 for A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate
Figure 3 for A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate
Figure 4 for A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate

Recently, GAN vocoders have seen rapid progress in speech synthesis, starting to outperform autoregressive models in perceptual quality with much higher generation speed. However, autoregressive vocoders are still the common choice for neural generation of speech signals coded at very low bit rates. In this paper, we present a GAN vocoder which is able to generate wideband speech waveforms from parameters coded at 1.6 kbit/s. The proposed model is a modified version of the StyleMelGAN vocoder that can run in frame-by-frame manner, making it suitable for streaming applications. The experimental results show that the proposed model significantly outperforms prior autoregressive vocoders like LPCNet for very low bit rate speech coding, with computational complexity of about 5 GMACs, providing a new state of the art in this domain. Moreover, this streamwise adversarial vocoder delivers quality competitive to advanced speech codecs such as EVS at 5.9 kbit/s on clean speech, which motivates further usage of feed-forward fully-convolutional models for low bit rate speech coding.

* Accepted to the 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2021) 
Viaarxiv icon

StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization

Nov 03, 2020
Ahmed Mustafa, Nicola Pia, Guillaume Fuchs

Figure 1 for StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization
Figure 2 for StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization
Figure 3 for StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization
Figure 4 for StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization

In recent years, neural vocoders have surpassed classical speech generation approaches in naturalness and perceptual quality of the synthesized speech. Computationally heavy models like WaveNet and WaveGlow achieve best results, while lightweight GAN models, e.g. MelGAN and Parallel WaveGAN, remain inferior in terms of perceptual quality. We therefore propose StyleMelGAN, a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity. StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech. For efficient training, multiple random-window discriminators adversarially evaluate the speech signal analyzed by a filter bank, with regularization provided by a multi-scale spectral reconstruction loss. The highly parallelizable speech generation is several times faster than real-time on CPUs and GPUs. MUSHRA and P.800 listening tests show that StyleMelGAN outperforms prior neural vocoders in copy-synthesis and Text-to-Speech scenarios.

* Submitted to ICASSP 2021 
Viaarxiv icon