Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Neural Dubber: Dubbing for Silent Videos According to Scripts

Oct 15, 2021
Chenxu Hu, Qiao Tian, Tingle Li, Yuping Wang, Yuxuan Wang, Hang Zhao

Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in synchronization with the pre-recorded videos. In this work, we propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task: synthesizing human speech synchronized with the given silent video from the text. Neural Dubber is a multi-modal text-to-speech (TTS) model that utilizes the lip movement in the video to control the prosody of the generated speech. Furthermore, an image-based speaker embedding (ISE) module is developed for the multi-speaker setting, which enables Neural Dubber to generate speech with a reasonable timbre according to the speaker's face. Experiments on the chemistry lecture single-speaker dataset and LRS2 multi-speaker dataset show that Neural Dubber can generate speech audios on par with state-of-the-art TTS models in terms of speech quality. Most importantly, both qualitative and quantitative evaluations show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.

* Accepted by NeurIPS 2021 

  Access Paper or Ask Questions

Separation Guided Speaker Diarization in Realistic Mismatched Conditions

Jul 06, 2021
Shu-Tong Niu, Jun Du, Lei Sun, Chin-Hui Lee

We propose a separation guided speaker diarization (SGSD) approach by fully utilizing a complementarity of speech separation and speaker clustering. Since the conventional clustering-based speaker diarization (CSD) approach cannot well handle overlapping speech segments, we investigate, in this study, separation-based speaker diarization (SSD) which inherently has the potential to handle the speaker overlap regions. Our preliminary analysis shows that the state-of-the-art Conv-TasNet based speech separation, which works quite well on the simulation data, is unstable in realistic conversational speech due to the high mismatch speaking styles in simulated training speech and read speech. In doing so, separation-based processing can assist CSD in handling the overlapping speech segments under the realistic mismatched conditions. Specifically, several strategies are designed to select between the results of SSD and CSD systems based on an analysis of the instability of the SSD system performances. Experiments on the conversational telephone speech (CTS) data from DIHARD-III Challenge show that the proposed SGSD system can significantly improve the performance of state-of-the-art CSD systems, yielding relative diarization error rate reductions of 20.2% and 20.8% on the development set and evaluation set, respectively.

  Access Paper or Ask Questions

PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components

Feb 15, 2021
Yukiya Hono, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

We propose PeriodNet, a non-autoregressive (non-AR) waveform generation model with a new model structure for modeling periodic and aperiodic components in speech waveforms. The non-AR waveform generation models can generate speech waveforms parallelly and can be used as a speech vocoder by conditioning an acoustic feature. Since a speech waveform contains periodic and aperiodic components, both components should be appropriately modeled to generate a high-quality speech waveform. However, it is difficult to decompose the components from a natural speech waveform in advance. To address this issue, we propose a parallel model and a series model structure separating periodic and aperiodic components. The features of our proposed models are that explicit periodic and aperiodic signals are taken as input, and external periodic/aperiodic decomposition is not needed in training. Experiments using a singing voice corpus show that our proposed structure improves the naturalness of the generated waveform. We also show that the speech waveforms with a pitch outside of the training data range can be generated with more naturalness.

* 5 pages, accepted to ICASSP 2021 

  Access Paper or Ask Questions

Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation

Jun 09, 2020
Changhan Wang, Juan Pino, Jiatao Gu

Transfer learning from high-resource languages is known to be an efficient way to improve end-to-end automatic speech recognition (ASR) for low-resource languages. Pre-trained or jointly trained encoder-decoder models, however, do not share the language modeling (decoder) for the same language, which is likely to be inefficient for distant target languages. We introduce speech-to-text translation (ST) as an auxiliary task to incorporate additional knowledge of the target language and enable transferring from that target language. Specifically, we first translate high-resource ASR transcripts into a target low-resource language, with which a ST model is trained. Both ST and target ASR share the same attention-based encoder-decoder architecture and vocabulary. The former task then provides a fully pre-trained model for the latter, bringing up to 24.6% word error rate (WER) reduction to the baseline (direct transfer from high-resource ASR). We show that training ST with human translations is not necessary. ST trained with machine translation (MT) pseudo-labels brings consistent gains. It can even outperform those using human labels when transferred to target ASR by leveraging only 500K MT examples. Even with pseudo-labels from low-resource MT (200K examples), ST-enhanced transfer brings up to 8.9% WER reduction to direct transfer.

* Submitted to INTERSPEECH 2020 

  Access Paper or Ask Questions

Cloning one's voice using very limited data in the wild

Oct 08, 2021
Dongyang Dai, Yuanzhe Chen, Li Chen, Ming Tu, Lu Liu, Rui Xia, Qiao Tian, Yuping Wang, Yuxuan Wang

With the increasing popularity of speech synthesis products, the industry has put forward more requirements for personalized speech synthesis: (1) How to use low-resource, easily accessible data to clone a person's voice. (2) How to clone a person's voice while controlling the style and prosody. To solve the above two problems, we proposed the Hieratron model framework in which the prosody and timbre are modeled separately using two modules, therefore, the independent control of timbre and the other characteristics of audio can be achieved while generating speech. The practice shows that, for very limited target speaker data in the wild, Hieratron has obvious advantages over the traditional method, in addition to controlling the style and language of the generated speech, the mean opinion score on speech quality of the generated speech has also been improved by more than 0.2 points.

  Access Paper or Ask Questions

KSoF: The Kassel State of Fluency Dataset -- A Therapy Centered Dataset of Stuttering

Mar 10, 2022
Sebastian P. Bayerl, Alexander Wolff von Gudenberg, Florian Hönig, Elmar Nöth, Korbinian Riedhammer

Stuttering is a complex speech disorder that negatively affects an individual's ability to communicate effectively. Persons who stutter (PWS) often suffer considerably under the condition and seek help through therapy. Fluency shaping is a therapy approach where PWSs learn to modify their speech to help them to overcome their stutter. Mastering such speech techniques takes time and practice, even after therapy. Shortly after therapy, success is evaluated highly, but relapse rates are high. To be able to monitor speech behavior over a long time, the ability to detect stuttering events and modifications in speech could help PWSs and speech pathologists to track the level of fluency. Monitoring could create the ability to intervene early by detecting lapses in fluency. To the best of our knowledge, no public dataset is available that contains speech from people who underwent stuttering therapy that changed the style of speaking. This work introduces the Kassel State of Fluency (KSoF), a therapy-based dataset containing over 5500 clips of PWSs. The clips were labeled with six stuttering-related event types: blocks, prolongations, sound repetitions, word repetitions, interjections, and - specific to therapy - speech modifications. The audio was recorded during therapy sessions at the Institut der Kasseler Stottertherapie. The data will be made available for research purposes upon request.

* Submitted to LREC 2022 Conference on Language Resources and Evaluation 

  Access Paper or Ask Questions

StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization

Nov 03, 2020
Ahmed Mustafa, Nicola Pia, Guillaume Fuchs

In recent years, neural vocoders have surpassed classical speech generation approaches in naturalness and perceptual quality of the synthesized speech. Computationally heavy models like WaveNet and WaveGlow achieve best results, while lightweight GAN models, e.g. MelGAN and Parallel WaveGAN, remain inferior in terms of perceptual quality. We therefore propose StyleMelGAN, a lightweight neural vocoder allowing synthesis of high-fidelity speech with low computational complexity. StyleMelGAN employs temporal adaptive normalization to style a low-dimensional noise vector with the acoustic features of the target speech. For efficient training, multiple random-window discriminators adversarially evaluate the speech signal analyzed by a filter bank, with regularization provided by a multi-scale spectral reconstruction loss. The highly parallelizable speech generation is several times faster than real-time on CPUs and GPUs. MUSHRA and P.800 listening tests show that StyleMelGAN outperforms prior neural vocoders in copy-synthesis and Text-to-Speech scenarios.

* Submitted to ICASSP 2021 

  Access Paper or Ask Questions

USEV: Universal Speaker Extraction with Visual Cue

Sep 30, 2021
Zexu Pan, Meng Ge, Haizhou Li

A speaker extraction algorithm seeks to extract the target speaker's voice from a multi-talker speech mixture. An auxiliary reference, such as a video recording or a pre-recorded speech, is usually used as a cue to form a top-down auditory attention. The prior studies are focused mostly on speaker extraction from a multi-talker speech mixture with highly overlapping speakers. However, a multi-talker speech mixture is often sparsely overlapped, furthermore, the target speaker could even be absent sometimes. In this paper, we propose a universal speaker extraction network that works for all multi-talker scenarios, where the target speaker can be either absent or present. When the target speaker is present, the network performs over a wide range of target-interference speaker overlapping ratios, from 0% to 100%. The speech in such universal multi-talker scenarios is generally described as sparsely overlapped speech. We advocate that a visual cue, i.e. lips movement, is more informative to serve as the auxiliary reference than an audio cue, i.e. pre-recorded speech. In addition, we propose a scenario-aware differentiated loss function for network training. The experimental results show that our proposed network outperforms various competitive baselines in disentangling sparsely overlapped speech in terms of signal fidelity and perceptual evaluations.

  Access Paper or Ask Questions

Acoustic-to-Word Models with Conversational Context Information

May 21, 2019
Suyoun Kim, Florian Metze

Conversational context information, higher-level knowledge that spans across sentences, can help to recognize a long conversation. However, existing speech recognition models are typically built at a sentence level, and thus it may not capture important conversational context information. The recent progress in end-to-end speech recognition enables integrating context with other available information (e.g., acoustic, linguistic resources) and directly recognizing words from speech. In this work, we present a direct acoustic-to-word, end-to-end speech recognition model capable of utilizing the conversational context to better process long conversations. We evaluate our proposed approach on the Switchboard conversational speech corpus and show that our system outperforms a standard end-to-end speech recognition system.

* NAACL 2019. arXiv admin note: text overlap with arXiv:1808.02171 

  Access Paper or Ask Questions

Towards Automated Assessment of Stuttering and Stuttering Therapy

Jun 16, 2020
Sebastian P. Bayerl, Florian Hönig, Joelle Reister, Korbinian Riedhammer

Stuttering is a complex speech disorder that can be identified by repetitions, prolongations of sounds, syllables or words, and blocks while speaking. Severity assessment is usually done by a speech therapist. While attempts at automated assessment were made, it is rarely used in therapy. Common methods for the assessment of stuttering severity include percent stuttered syllables (% SS), the average of the three longest stuttering symptoms during a speech task, or the recently introduced Speech Efficiency Score (SES). This paper introduces the Speech Control Index (SCI), a new method to evaluate the severity of stuttering. Unlike SES, it can also be used to assess therapy success for fluency shaping. We evaluate both SES and SCI on a new comprehensively labeled dataset containing stuttered German speech of clients prior to, during, and after undergoing stuttering therapy. Phone alignments of an automatic speech recognition system are statistically evaluated in relation to their relative position to labeled stuttering events. The results indicate that phone length distributions differ with respect to their position in and around labeled stuttering events

* 10 pages, 3 figures, 1 table Accepted at TSD 2020, 23rd International Conference on Text, Speech and Dialogue 

  Access Paper or Ask Questions