Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Supervised Attention in Sequence-to-Sequence Models for Speech Recognition

Apr 25, 2022
Gene-Ping Yang, Hao Tang

Attention mechanism in sequence-to-sequence models is designed to model the alignments between acoustic features and output tokens in speech recognition. However, attention weights produced by models trained end to end do not always correspond well with actual alignments, and several studies have further argued that attention weights might not even correspond well with the relevance attribution of frames. Regardless, visual similarity between attention weights and alignments is widely used during training as an indicator of the models quality. In this paper, we treat the correspondence between attention weights and alignments as a learning problem by imposing a supervised attention loss. Experiments have shown significant improved performance, suggesting that learning the alignments well during training critically determines the performance of sequence-to-sequence models.

* Accepted at ICASSP 2022 

  Access Paper or Ask Questions

An Asynchronous WFST-Based Decoder For Automatic Speech Recognition

Mar 16, 2021
Hang Lv, Zhehuai Chen, Hainan Xu, Daniel Povey, Lei Xie, Sanjeev Khudanpur

We introduce asynchronous dynamic decoder, which adopts an efficient A* algorithm to incorporate big language models in the one-pass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly composition decoder which might induce a significant computation overhead, the asynchronous dynamic decoder has a novel design where it has two fronts, with one performing "exploration" and the other "backfill". The computation of the two fronts alternates in the decoding process, resulting in more effective pruning than the standard one-pass decoding with an on-the-fly composition decoder. Experiments show that the proposed decoder works notably faster than the standard one-pass decoding with on-the-fly composition decoder, while the acceleration will be more obvious with the increment of data complexity.

* 5 pages, 5 figures, icassp 

  Access Paper or Ask Questions

Do End-to-End Speech Recognition Models Care About Context?

Feb 17, 2021
Lasse Borgholt, Jakob Drachmann Havtorn, Željko Agić, Anders Søgaard, Lars Maaløe, Christian Igel

The two most common paradigms for end-to-end speech recognition are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. It has been argued that the latter is better suited for learning an implicit language model. We test this hypothesis by measuring temporal context sensitivity and evaluate how the models perform when we constrain the amount of contextual information in the audio input. We find that the AED model is indeed more context sensitive, but that the gap can be closed by adding self-attention to the CTC model. Furthermore, the two models perform similarly when contextual information is constrained. Finally, in contrast to previous research, our results show that the CTC model is highly competitive on WSJ and LibriSpeech without the help of an external language model.

* Published in the proceedings of INTERSPEECH 2020, pp. 4352-4356 

  Access Paper or Ask Questions

WaveGlow: A Flow-based Generative Network for Speech Synthesis

Oct 31, 2018
Ryan Prenger, Rafael Valle, Bryan Catanzaro

In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable. Our PyTorch implementation produces audio samples at a rate of more than 500 kHz on an NVIDIA V100 GPU. Mean Opinion Scores show that it delivers audio quality as good as the best publicly available WaveNet implementation. All code will be made publicly available online.

* 5 pages, 1 figure, 1 table, 13 equations 

  Access Paper or Ask Questions

ZR-2021VG: Zero-Resource Speech Challenge, Visually-Grounded Language Modelling track, 2021 edition

Jul 14, 2021
Afra Alishahia, Grzegorz Chrupała, Alejandrina Cristia, Emmanuel Dupoux, Bertrand Higy, Marvin Lavechin, Okko Räsänen, Chen Yu

We present the visually-grounded language modelling track that was introduced in the Zero-Resource Speech challenge, 2021 edition, 2nd round. We motivate the new track and discuss participation rules in detail. We also present the two baseline systems that were developed for this track.

  Access Paper or Ask Questions

Improving Stability of LS-GANs for Audio and Speech Signals

Aug 12, 2020
Mohammad Esmaeilpour, Raymel Alfonso Sallo, Olivier St-Georges, Patrick Cardinal, Alessandro Lameiras Koerich

In this paper we address the instability issue of generative adversarial network (GAN) by proposing a new similarity metric in unitary space of Schur decomposition for 2D representations of audio and speech signals. We show that encoding departure from normality computed in this vector space into the generator optimization formulation helps to craft more comprehensive spectrograms. We demonstrate the effectiveness of binding this metric for enhancing stability in training with less mode collapse compared to baseline GANs. Experimental results on subsets of UrbanSound8k and Mozilla common voice datasets have shown considerable improvements on the quality of the generated samples measured by the Fr\'echet inception distance. Moreover, reconstructed signals from these samples, have achieved higher signal to noise ratio compared to regular LS-GANs.

* 10 pages 

  Access Paper or Ask Questions

Machine Learning Approaches for Amharic Parts-of-speech Tagging

Jan 10, 2020
Ibrahim Gashaw, H L. Shashirekha

Part-of-speech (POS) tagging is considered as one of the basic but necessary tools which are required for many Natural Language Processing (NLP) applications such as word sense disambiguation, information retrieval, information processing, parsing, question answering, and machine translation. Performance of the current POS taggers in Amharic is not as good as that of the contemporary POS taggers available for English and other European languages. The aim of this work is to improve POS tagging performance for the Amharic language, which was never above 91%. Usage of morphological knowledge, an extension of the existing annotated data, feature extraction, parameter tuning by applying grid search and the tagging algorithms have been examined and obtained significant performance difference from the previous works. We have used three different datasets for POS experiments.

* 15th International Conference on Natural Language Processing (ICON-2018) 

  Access Paper or Ask Questions

Training Augmentation with Adversarial Examples for Robust Speech Recognition

Jun 17, 2018
Sining Sun, Ching-Feng Yeh, Mari Ostendorf, Mei-Yuh Hwang, Lei Xie

This paper explores the use of adversarial examples in training speech recognition systems to increase robustness of deep neural network acoustic models. During training, the fast gradient sign method is used to generate adversarial examples augmenting the original training data. Different from conventional data augmentation based on data transformations, the examples are dynamically generated based on current acoustic model parameters. We assess the impact of adversarial data augmentation in experiments on the Aurora-4 and CHiME-4 single-channel tasks, showing improved robustness against noise and channel variation. Further improvement is obtained when combining adversarial examples with teacher/student training, leading to a 23% relative word error rate reduction on Aurora-4.

  Access Paper or Ask Questions

Part-of-Speech Relevance Weights for Learning Word Embeddings

Mar 24, 2016
Quan Liu, Zhen-Hua Ling, Hui Jiang, Yu Hu

This paper proposes a model to learn word embeddings with weighted contexts based on part-of-speech (POS) relevance weights. POS is a fundamental element in natural language. However, state-of-the-art word embedding models fail to consider it. This paper proposes to use position-dependent POS relevance weighting matrices to model the inherent syntactic relationship among words within a context window. We utilize the POS relevance weights to model each word-context pairs during the word embedding training process. The model proposed in this paper paper jointly optimizes word vectors and the POS relevance matrices. Experiments conducted on popular word analogy and word similarity tasks all demonstrated the effectiveness of the proposed method.

* Word embeddings 

  Access Paper or Ask Questions

Unsupervised Quantized Prosody Representation for Controllable Speech Synthesis

Apr 07, 2022
Yutian Wang, Yuankun Xie, Kun Zhao, Hui Wang, Qin Zhang

In this paper, we propose a novel prosody disentangle method for prosodic Text-to-Speech (TTS) model, which introduces the vector quantization (VQ) method to the auxiliary prosody encoder to obtain the decomposed prosody representations in an unsupervised manner. Rely on its advantages, the speaking styles, such as pitch, speaking velocity, local pitch variance, etc., are decomposed automatically into the latent quantize vectors. We also investigate the internal mechanism of VQ disentangle process by means of a latent variables counter and find that higher value dimensions usually represent prosody information. Experiments show that our model can control the speaking styles of synthesis results by directly manipulating the latent variables. The objective and subjective evaluations illustrated that our model outperforms the popular models.

* accepted by IEEE International Conference on Multimedia and Expo 2022 (ICME2022) 

  Access Paper or Ask Questions