Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Interactive spatial speech recognition maps based on simulated speech recognition experiments

Apr 01, 2021
Marc René Schädler

In their everyday life, the speech recognition performance of human listeners is influenced by diverse factors, such as the acoustic environment, the talker and listener positions, possibly impaired hearing, and optional hearing devices. Prediction models come closer to considering all required factors simultaneously to predict the individual speech recognition performance in complex acoustic environments. While such predictions may still not be sufficiently accurate for serious applications, they can already be performed and demand an accessible representation. In this contribution, an interactive representation of speech recognition performance is proposed, which focuses on the listeners head orientation and the spatial dimensions of an acoustic scene. A exemplary modeling toolchain, including an acoustic rendering model, a hearing device model, and a listener model, was used to generate a data set for demonstration purposes. Using the spatial speech recognition maps to explore this data set demonstrated the suitability of the approach to observe possibly relevant behavior. The proposed representation provides a suitable target to compare and validate different modeling approaches in ecologically relevant contexts. Eventually, it may serve as a tool to use validated prediction models in the design of spaces and devices which take speech communication into account.

* 16 pages, 11 figures, related code 

  Access Paper or Ask Questions

Modified Mel Filter Bank to Compute MFCC of Subsampled Speech

Oct 25, 2014
Kiran Kumar Bhuvanagiri, Sunil Kumar Kopparapu

Mel Frequency Cepstral Coefficients (MFCCs) are the most popularly used speech features in most speech and speaker recognition applications. In this work, we propose a modified Mel filter bank to extract MFCCs from subsampled speech. We also propose a stronger metric which effectively captures the correlation between MFCCs of original speech and MFCC of resampled speech. It is found that the proposed method of filter bank construction performs distinguishably well and gives recognition performance on resampled speech close to recognition accuracies on original speech.

* arXiv admin note: substantial text overlap with arXiv:1410.6903 

  Access Paper or Ask Questions

Sampling-based speech parameter generation using moment-matching networks

Apr 12, 2017
Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari

This paper presents sampling-based speech parameter generation using moment-matching networks for Deep Neural Network (DNN)-based speech synthesis. Although people never produce exactly the same speech even if we try to express the same linguistic and para-linguistic information, typical statistical speech synthesis produces completely the same speech, i.e., there is no inter-utterance variation in synthetic speech. To give synthetic speech natural inter-utterance variation, this paper builds DNN acoustic models that make it possible to randomly sample speech parameters. The DNNs are trained so that they make the moments of generated speech parameters close to those of natural speech parameters. Since the variation of speech parameters is compressed into a low-dimensional simple prior noise vector, our algorithm has lower computation cost than direct sampling of speech parameters. As the first step towards generating synthetic speech that has natural inter-utterance variation, this paper investigates whether or not the proposed sampling-based generation deteriorates synthetic speech quality. In evaluation, we compare speech quality of conventional maximum likelihood-based generation and proposed sampling-based generation. The result demonstrates the proposed generation causes no degradation in speech quality.

* Submitted to INTERSPEECH 2017 

  Access Paper or Ask Questions

A Bengali HMM Based Speech Synthesis System

Jun 16, 2014
Sankar Mukherjee, Shyamal Kumar Das Mandal

The paper presents the capability of an HMM-based TTS system to produce Bengali speech. In this synthesis method, trajectories of speech parameters are generated from the trained Hidden Markov Models. A final speech waveform is synthesized from those speech parameters. In our experiments, spectral properties were represented by Mel Cepstrum Coefficients. Both the training and synthesis issues are investigated in this paper using annotated Bengali speech database. Experimental evaluation depicts that the developed text-to-speech system is capable of producing adequately natural speech in terms of intelligibility and intonation for Bengali.

* Oriental COCOSDA 2012, pp.225 259 

  Access Paper or Ask Questions

Creating Speech-to-Speech Corpus from Dubbed Series

Mar 07, 2022
Massa Baali, Wassim El-Hajj, Ahmed Ali

Dubbed series are gaining a lot of popularity in recent years with strong support from major media service providers. Such popularity is fueled by studies that showed that dubbed versions of TV shows are more popular than their subtitled equivalents. We propose an unsupervised approach to construct speech-to-speech corpus, aligned on short segment levels, to produce a parallel speech corpus in the source- and target- languages. Our methodology exploits video frames, speech recognition, machine translation, and noisy frames removal algorithms to match segments in both languages. To verify the performance of the proposed method, we apply it on long and short dubbed clips. Out of 36 hours TR-AR dubbed series, our pipeline was able to generate 17 hours of paired segments, which is about 47% of the corpus. We applied our method on another language pair, EN-AR, to ensure it is robust enough and not tuned for a specific language or a specific corpus. Regardless of the language pairs, the accuracy of the paired segments was around 70% when evaluated using human subjective evaluation. The corpus will be freely available for the research community.

  Access Paper or Ask Questions

Unsupervised Speech Domain Adaptation Based on Disentangled Representation Learning for Robust Speech Recognition

Apr 12, 2019
Jong-Hyeon Park, Myungwoo Oh, Hyung-Min Park

In general, the performance of automatic speech recognition (ASR) systems is significantly degraded due to the mismatch between training and test environments. Recently, a deep-learning-based image-to-image translation technique to translate an image from a source domain to a desired domain was presented, and cycle-consistent adversarial network (CycleGAN) was applied to learn a mapping for speech-to-speech conversion from a speaker to a target speaker. However, this method might not be adequate to remove corrupting noise components for robust ASR because it was designed to convert speech itself. In this paper, we propose a domain adaptation method based on generative adversarial nets (GANs) with disentangled representation learning to achieve robustness in ASR systems. In particular, two separated encoders, context and domain encoders, are introduced to learn distinct latent variables. The latent variables allow us to convert the domain of speech according to its context and domain representation. We improved word accuracies by 6.55~15.70\% for the CHiME4 challenge corpus by applying a noisy-to-clean environment adaptation for robust ASR. In addition, similar to the method based on the CycleGAN, this method can be used for gender adaptation in gender-mismatched recognition.

* Submitted to Interspeech 2019 

  Access Paper or Ask Questions

VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration

Apr 17, 2022
Haohe Liu, Xubo Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, Yuxuan Wang

Speech restoration aims to remove distortions in speech signals. Prior methods mainly focus on a single type of distortion, such as speech denoising or dereverberation. However, speech signals can be degraded by several different distortions simultaneously in the real world. It is thus important to extend speech restoration models to deal with multiple distortions. In this paper, we introduce VoiceFixer, a unified framework for high-fidelity speech restoration. VoiceFixer restores speech from multiple distortions (e.g., noise, reverberation, and clipping) and can expand degraded speech (e.g., noisy speech) with a low bandwidth to 44.1 kHz full-bandwidth high-fidelity speech. We design VoiceFixer based on (1) an analysis stage that predicts intermediate-level features from the degraded speech, and (2) a synthesis stage that generates waveform using a neural vocoder. Both objective and subjective evaluations show that VoiceFixer is effective on severely degraded speech, such as real-world historical speech recordings. Samples of VoiceFixer are available at

* Submitted to INTERSPEECH 2022 

  Access Paper or Ask Questions

Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion

Aug 13, 2020
Dipjyoti Paul, Muhammed PV Shifas, Yannis Pantazis, Yannis Stylianou

The increased adoption of digital assistants makes text-to-speech (TTS) synthesis systems an indispensable feature of modern mobile devices. It is hence desirable to build a system capable of generating highly intelligible speech in the presence of noise. Past studies have investigated style conversion in TTS synthesis, yet degraded synthesized quality often leads to worse intelligibility. To overcome such limitations, we proposed a novel transfer learning approach using Tacotron and WaveRNN based TTS synthesis. The proposed speech system exploits two modification strategies: (a) Lombard speaking style data and (b) Spectral Shaping and Dynamic Range Compression (SSDRC) which has been shown to provide high intelligibility gains by redistributing the signal energy on the time-frequency domain. We refer to this extension as Lombard-SSDRC TTS system. Intelligibility enhancement as quantified by the Intelligibility in Bits (SIIB-Gauss) measure shows that the proposed Lombard-SSDRC TTS system shows significant relative improvement between 110% and 130% in speech-shaped noise (SSN), and 47% to 140% in competing-speaker noise (CSN) against the state-of-the-art TTS approach. Additional subjective evaluation shows that Lombard-SSDRC TTS successfully increases the speech intelligibility with relative improvement of 455% for SSN and 104% for CSN in median keyword correction rate compared to the baseline TTS method.

* Accepted in INTERSPEECH 2020 

  Access Paper or Ask Questions

A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing

Mar 18, 2022
He Bai, Renjie Zheng, Junkun Chen, Xintong Li, Mingbo Ma, Liang Huang

Recently, speech representation learning has improved many speech-related tasks such as speech recognition, speech classification, and speech-to-text translation. However, all the above tasks are in the direction of speech understanding, but for the inverse direction, speech synthesis, the potential of representation learning is yet to be realized, due to the challenging nature of generating high-quality speech. To address this problem, we propose our framework, Alignment-Aware Acoustic-Text Pretraining (A$^3$T), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training. In this way, the pretrained model can generate high quality of reconstructed spectrogram, which can be applied to the speech editing and unseen speaker TTS directly. Experiments show A$^3$T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.

* under review, 12 pages, 10 figures 

  Access Paper or Ask Questions