Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Attention-gated convolutional neural networks for off-resonance correction of spiral real-time MRI

Feb 14, 2021
Yongwan Lim, Shrikanth S. Narayanan, Krishna S. Nayak

Spiral acquisitions are preferred in real-time MRI because of their efficiency, which has made it possible to capture vocal tract dynamics during natural speech. A fundamental limitation of spirals is blurring and signal loss due to off-resonance, which degrades image quality at air-tissue boundaries. Here, we present a new CNN-based off-resonance correction method that incorporates an attention-gate mechanism. This leverages spatial and channel relationships of filtered outputs and improves the expressiveness of the networks. We demonstrate improved performance with the attention-gate, on 1.5 Tesla spiral speech RT-MRI, compared to existing off-resonance correction methods.

* 28th Int. Soc. Magn. Reson. Med. (ISMRM) Scientific Sessions, 2020, p.1005 
* 8 pages, 4 figures, 1 table 

  Access Paper or Ask Questions

Challenges of Computational Processing of Code-Switching

Oct 07, 2016
Özlem Çetinoğlu, Sarah Schulz, Ngoc Thang Vu

This paper addresses challenges of Natural Language Processing (NLP) on non-canonical multilingual data in which two or more languages are mixed. It refers to code-switching which has become more popular in our daily life and therefore obtains an increasing amount of attention from the research community. We report our experience that cov- ers not only core NLP tasks such as normalisation, language identification, language modelling, part-of-speech tagging and dependency parsing but also more downstream ones such as machine translation and automatic speech recognition. We highlight and discuss the key problems for each of the tasks with supporting examples from different language pairs and relevant previous work.

* Will appear in the Proceedings of the 2nd Workshop on Computational Approaches to Linguistic Code Switching @EMNLP, 2016 

  Access Paper or Ask Questions

CycleTransGAN-EVC: A CycleGAN-based Emotional Voice Conversion Model with Transformer

Nov 30, 2021
Changzeng Fu, Chaoran Liu, Carlos Toshinori Ishi, Hiroshi Ishiguro

In this study, we explore the transformer's ability to capture intra-relations among frames by augmenting the receptive field of models. Concretely, we propose a CycleGAN-based model with the transformer and investigate its ability in the emotional voice conversion task. In the training procedure, we adopt curriculum learning to gradually increase the frame length so that the model can see from the short segment till the entire speech. The proposed method was evaluated on the Japanese emotional speech dataset and compared to several baselines (ACVAE, CycleGAN) with objective and subjective evaluations. The results show that our proposed model is able to convert emotion with higher strength and quality.

  Access Paper or Ask Questions

A Robust Maximum Likelihood Distortionless Response Beamformer based on a Complex Generalized Gaussian Distribution

Feb 19, 2021
Weixin Meng, Chengshi Zheng, Xiaodong Li

For multichannel speech enhancement, this letter derives a robust maximum likelihood distortionless response beamformer by modeling speech sparse priors with a complex generalized Gaussian distribution, where we refer to as the CGGD-MLDR beamformer. The proposed beamformer can be regarded as a generalization of the minimum power distortionless response beamformer and its improved variations. For narrowband applications, we also reveal that the proposed beamformer reduces to the minimum dispersion distortionless response beamformer, which has been derived with the ${{\ell}_{p}}$-norm minimization. The mechanisms of the proposed beamformer in improving the robustness are clearly pointed out and experimental results show its better performance in PESQ improvement.

  Access Paper or Ask Questions

Show and Speak: Directly Synthesize Spoken Description of Images

Oct 23, 2020
Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg

This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.

  Access Paper or Ask Questions

Bayesian Subspace HMM for the Zerospeech 2020 Challenge

May 19, 2020
Bolaji Yusuf, Lucas Ondel

In this paper we describe our submission to the Zerospeech 2020 challenge, where the participants are required to discover latent representations from unannotated speech, and to use those representations to perform speech synthesis, with synthesis quality used as a proxy metric for the unit quality. In our system, we use the Bayesian Subspace Hidden Markov Model (SHMM) for unit discovery. The SHMM models each unit as an HMM whose parameters are constrained to lie in a low dimensional subspace of the total parameter space which is trained to model phonetic variability. Our system compares favorably with the baseline on the human-evaluated character error rate while maintaining significantly lower unit bitrate.

* Submitted to INTERSPEECH 2020 

  Access Paper or Ask Questions

Chirp Complex Cepstrum-based Decomposition for Asynchronous Glottal Analysis

May 10, 2020
Thomas Drugman, Thierry Dutoit

It was recently shown that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of speech. In order to guarantee a correct estimation, some constraints on the window have been derived. Among these, the window has to be synchronized on a Glottal Closure Instant. This paper proposes an extension of the complex cepstrum-based decomposition by incorporating a chirp analysis. The resulting method is shown to give a reliable estimation of the glottal flow wherever the window is located. This technique is then suited for its integration in usual speech processing systems, which generally operate in an asynchronous way. Besides its potential for automatic voice quality analysis is highlighted.

  Access Paper or Ask Questions

Dual-label Deep LSTM Dereverberation For Speaker Verification

Sep 08, 2018
Hao Zhang, Stephen Zahorian, Xiao Chen, Peter Guzewich, Xiaoyu Liu

In this paper, we present a reverberation removal approach for speaker verification, utilizing dual-label deep neural networks (DNNs). The networks perform feature mapping between the spectral features of reverberant and clean speech. Long short term memory recurrent neural networks (LSTMs) are trained to map corrupted Mel filterbank (MFB) features to two sets of labels: i) the clean MFB features, and ii) either estimated pitch tracks or the fast Fourier transform (FFT) spectrogram of clean speech. The performance of reverberation removal is evaluated by equal error rates (EERs) of speaker verification experiments.

* 4 pages, 3 figures, submitted to Interspeech 2018 

  Access Paper or Ask Questions

The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems

Jun 25, 2018
Adaeze Adigwe, Noé Tits, Kevin El Haddad, Sarah Ostadabbas, Thierry Dutoit

In this paper, we present a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English and a male actor in French. The database covers 5 emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to control the emotional dimension in a continuous way. We show the data's efficiency by building a simple MLP system converting neutral to angry speech style and evaluate it via a CMOS perception test. Even though the system is a very simple one, the test show the efficiency of the data which is promising for future work.

* Submitted to SLSP 2018 

  Access Paper or Ask Questions

Fitting New Speakers Based on a Short Untranscribed Sample

Feb 20, 2018
Eliya Nachmani, Adam Polyak, Yaniv Taigman, Lior Wolf

Learning-based Text To Speech systems have the potential to generalize from one speaker to the next and thus require a relatively short sample of any new voice. However, this promise is currently largely unrealized. We present a method that is designed to capture a new speaker from a short untranscribed audio sample. This is done by employing an additional network that given an audio sample, places the speaker in the embedding space. This network is trained as part of the speech synthesis system using various consistency losses. Our results demonstrate a greatly improved performance on both the dataset speakers, and, more importantly, when fitting new voices, even from very short samples.

  Access Paper or Ask Questions