Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Najim Dehak

Focus on the present: a regularization method for the ASR source-target attention layer

Nov 02, 2020

Nanxin Chen, Piotr Żelasko, Jesús Villalba, Najim Dehak

Figure 1 for Focus on the present: a regularization method for the ASR source-target attention layer

Figure 2 for Focus on the present: a regularization method for the ASR source-target attention layer

Figure 3 for Focus on the present: a regularization method for the ASR source-target attention layer

Figure 4 for Focus on the present: a regularization method for the ASR source-target attention layer

Abstract:This paper introduces a novel method to diagnose the source-target attention in state-of-the-art end-to-end speech recognition models with joint connectionist temporal classification (CTC) and attention training. Our method is based on the fact that both, CTC and source-target attention, are acting on the same encoder representations. To understand the functionality of the attention, CTC is applied to compute the token posteriors given the attention outputs. We found that the source-target attention heads are able to predict several tokens ahead of the current one. Inspired by the observation, a new regularization method is proposed which leverages CTC to make source-target attention more focused on the frames corresponding to the output token being predicted by the decoder. Experiments reveal stable improvements up to 7\% and 13\% relatively with the proposed regularization on TED-LIUM 2 and LibriSpeech.

* submitted to ICASSP2021. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Via

Access Paper or Ask Questions

CopyPaste: An Augmentation Method for Speech Emotion Recognition

Oct 27, 2020

Raghavendra Pappagari, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

Figure 1 for CopyPaste: An Augmentation Method for Speech Emotion Recognition

Figure 2 for CopyPaste: An Augmentation Method for Speech Emotion Recognition

Figure 3 for CopyPaste: An Augmentation Method for Speech Emotion Recognition

Figure 4 for CopyPaste: An Augmentation Method for Speech Emotion Recognition

Abstract:Data augmentation is a widely used strategy for training robust machine learning models. It partially alleviates the problem of limited data for tasks like speech emotion recognition (SER), where collecting data is expensive and challenging. This study proposes CopyPaste, a perceptually motivated novel augmentation procedure for SER. Assuming that the presence of emotions other than neutral dictates a speaker's overall perceived emotion in a recording, concatenation of an emotional (emotion E) and a neutral utterance can still be labeled with emotion E. We hypothesize that SER performance can be improved using these concatenated utterances in model training. To verify this, three CopyPaste schemes are tested on two deep learning models: one trained independently and another using transfer learning from an x-vector model, a speaker recognition model. We observed that all three CopyPaste schemes improve SER performance on all the three datasets considered: MSP-Podcast, Crema-D, and IEMOCAP. Additionally, CopyPaste performs better than noise augmentation and, using them together improves the SER performance further. Our experiments on noisy test sets suggested that CopyPaste is effective even in noisy test conditions.

* Under ICASSP2021 peer-review

Via

Access Paper or Ask Questions

How Phonotactics Affect Multilingual and Zero-shot ASR Performance

Oct 22, 2020

Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

Figure 1 for How Phonotactics Affect Multilingual and Zero-shot ASR Performance

Figure 2 for How Phonotactics Affect Multilingual and Zero-shot ASR Performance

Figure 3 for How Phonotactics Affect Multilingual and Zero-shot ASR Performance

Figure 4 for How Phonotactics Affect Multilingual and Zero-shot ASR Performance

Abstract:The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system's performance, and retaining only the target language's phonotactic data in LM training is preferable.

* Submitted to ICASSP 2021. The first 2 authors contributed equally to this work

Via

Access Paper or Ask Questions

Learning Speaker Embedding from Text-to-Speech

Oct 21, 2020

Jaejin Cho, Piotr Zelasko, Jesus Villalba, Shinji Watanabe, Najim Dehak

Figure 1 for Learning Speaker Embedding from Text-to-Speech

Figure 2 for Learning Speaker Embedding from Text-to-Speech

Figure 3 for Learning Speaker Embedding from Text-to-Speech

Figure 4 for Learning Speaker Embedding from Text-to-Speech

Abstract:Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker voices given an input text and the corresponding speaker embedding. In this work, we investigate the effectiveness of the TTS reconstruction objective to improve representation learning for speaker verification. We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion. We hypothesize that the embeddings will contain minimal phonetic information since the TTS decoder will obtain that information from the textual input. TTS reconstruction can also be combined with speaker classification to enhance these embeddings further. Once trained, the speaker encoder computes representations for the speaker verification task, while the rest of the TTS blocks are discarded. We investigated training TTS from either manual or ASR-generated transcripts. The latter allows us to train embeddings on datasets without manual transcripts. We compared ASR transcripts and Kaldi phone alignments as TTS inputs, showing that the latter performed better due to their finer resolution. Unsupervised TTS embeddings improved EER by 2.06\% absolute with regard to i-vectors for the LibriTTS dataset. TTS with speaker classification loss improved EER by 0.28\% and 0.73\% absolutely from a model using only speaker classification loss in LibriTTS and Voxceleb1 respectively.

Via

Access Paper or Ask Questions

Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery

Jul 26, 2020

Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Najim Dehak

Figure 1 for Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery

Figure 2 for Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery

Figure 3 for Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery

Abstract:Unsupervised spoken term discovery consists of two tasks: finding the acoustic segment boundaries and labeling acoustically similar segments with the same labels. We perform segmentation based on the assumption that the frame feature vectors are more similar within a segment than across the segments. Therefore, for strong segmentation performance, it is crucial that the features represent the phonetic properties of a frame more than other factors of variability. We achieve this via a self-expressing autoencoder framework. It consists of a single encoder and two decoders with shared weights. The encoder projects the input features into a latent representation. One of the decoders tries to reconstruct the input from these latent representations and the other from the self-expressed version of them. We use the obtained features to segment and cluster the speech data. We evaluate the performance of the proposed method in the Zero Resource 2020 challenge unit discovery task. The proposed system consistently outperforms the baseline, demonstrating the usefulness of the method in learning representations.

Via

Access Paper or Ask Questions

That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

May 16, 2020

Piotr Żelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

Figure 1 for That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

Figure 2 for That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

Figure 3 for That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

Figure 4 for That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

Abstract:Only a handful of the world's languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus on gaining a deeper understanding of how general these representations might be, and how individual phones are getting improved in a multilingual setting. To that end, we select a phonetically diverse set of languages, and perform a series of monolingual, multilingual and crosslingual (zero-shot) experiments. The ASR is trained to recognize the International Phonetic Alphabet (IPA) token sequences. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting, where the model, among other errors, considers Javanese as a tone language. Notably, as little as 10 hours of the target language training data tremendously reduces ASR error rates. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages - an encouraging result for the low-resource speech community.

* Submitted to Interspeech 2020. For some reason, the ArXiv Latex engine rendered it in more than 4 pages

Via

Access Paper or Ask Questions

Punctuation Prediction in Spontaneous Conversations: Can We Mitigate ASR Errors with Retrofitted Word Embeddings?

Apr 13, 2020

Łukasz Augustyniak, Piotr Szymanski, Mikołaj Morzy, Piotr Zelasko, Adrian Szymczak, Jan Mizgajski, Yishay Carmiel, Najim Dehak

Figure 1 for Punctuation Prediction in Spontaneous Conversations: Can We Mitigate ASR Errors with Retrofitted Word Embeddings?

Figure 2 for Punctuation Prediction in Spontaneous Conversations: Can We Mitigate ASR Errors with Retrofitted Word Embeddings?

Figure 3 for Punctuation Prediction in Spontaneous Conversations: Can We Mitigate ASR Errors with Retrofitted Word Embeddings?

Figure 4 for Punctuation Prediction in Spontaneous Conversations: Can We Mitigate ASR Errors with Retrofitted Word Embeddings?

Abstract:Automatic Speech Recognition (ASR) systems introduce word errors, which often confuse punctuation prediction models, turning punctuation restoration into a challenging task. These errors usually take the form of homonyms. We show how retrofitting of the word embeddings on the domain-specific data can mitigate ASR errors. Our main contribution is a method for better alignment of homonym embeddings and the validation of the presented method on the punctuation prediction task. We record the absolute improvement in punctuation prediction accuracy between 6.2% (for question marks) to 9% (for periods) when compared with the state-of-the-art model.

* submitted to INTERSPEECH'20

Via

Access Paper or Ask Questions

x-vectors meet emotions: A study on dependencies between emotion and speaker recognition

Feb 12, 2020

Raghavendra Pappagari, Tianzi Wang, Jesus Villalba, Nanxin Chen, Najim Dehak

Figure 1 for x-vectors meet emotions: A study on dependencies between emotion and speaker recognition

Figure 2 for x-vectors meet emotions: A study on dependencies between emotion and speaker recognition

Figure 3 for x-vectors meet emotions: A study on dependencies between emotion and speaker recognition

Figure 4 for x-vectors meet emotions: A study on dependencies between emotion and speaker recognition

Abstract:In this work, we explore the dependencies between speaker recognition and emotion recognition. We first show that knowledge learned for speaker recognition can be reused for emotion recognition through transfer learning. Then, we show the effect of emotion on speaker recognition. For emotion recognition, we show that using a simple linear model is enough to obtain good performance on the features extracted from pre-trained models such as the x-vector model. Then, we improve emotion recognition performance by fine-tuning for emotion classification. We evaluated our experiments on three different types of datasets: IEMOCAP, MSP-Podcast, and Crema-D. By fine-tuning, we obtained 30.40%, 7.99%, and 8.61% absolute improvement on IEMOCAP, MSP-Podcast, and Crema-D respectively over baseline model with no pre-training. Finally, we present results on the effect of emotion on speaker verification. We observed that speaker verification performance is prone to changes in test speaker emotions. We found that trials with angry utterances performed worst in all three datasets. We hope our analysis will initiate a new line of research in the speaker recognition community.

* 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020

Via

Access Paper or Ask Questions

Non-Autoregressive Transformer Automatic Speech Recognition

Nov 10, 2019

Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim Dehak

Figure 1 for Non-Autoregressive Transformer Automatic Speech Recognition

Figure 2 for Non-Autoregressive Transformer Automatic Speech Recognition

Figure 3 for Non-Autoregressive Transformer Automatic Speech Recognition

Figure 4 for Non-Autoregressive Transformer Automatic Speech Recognition

Abstract:Recently very deep transformers start showing outperformed performance to traditional bi-directional long short-term memory networks by a large margin. However, to put it into production usage, inference computation cost and latency are still serious concerns in real scenarios. In this paper, we study a novel non-autoregressive transformers structure for speech recognition, which is originally introduced in machine translation. During training input tokens fed to the decoder are randomly replaced by a special mask token. The network is required to predict those mask tokens by taking both context and input speech into consideration. During inference, we start from all mask tokens and the network gradually predicts all tokens based on partial results. We show this framework can support different decoding strategies, including traditional left-to-right. A new decoding strategy is proposed as an example, which starts from the easiest predictions to difficult ones. Some preliminary results on Aishell and CSJ benchmarks show the possibility to train such a non-autoregressive network for ASR. Especially in Aishell, the proposed method outperformed Kaldi nnet3 and chain model setup and is quite closed to the performance of the start-of-the-art end-to-end model.

Via

Access Paper or Ask Questions

Hierarchical Transformers for Long Document Classification

Oct 23, 2019

Raghavendra Pappagari, Piotr Żelasko, Jesús Villalba, Yishay Carmiel, Najim Dehak

Figure 1 for Hierarchical Transformers for Long Document Classification

Figure 2 for Hierarchical Transformers for Long Document Classification

Figure 3 for Hierarchical Transformers for Long Document Classification

Figure 4 for Hierarchical Transformers for Long Document Classification

Abstract:BERT, which stands for Bidirectional Encoder Representations from Transformers, is a recently introduced language representation model based upon the transfer learning paradigm. We extend its fine-tuning procedure to address one of its major limitations - applicability to inputs longer than a few hundred words, such as transcripts of human call conversations. Our method is conceptually simple. We segment the input into smaller chunks and feed each of them into the base model. Then, we propagate each output through a single recurrent layer, or another transformer, followed by a softmax activation. We obtain the final classification decision after the last segment has been consumed. We show that both BERT extensions are quick to fine-tune and converge after as little as 1 epoch of training on a small, domain-specific data set. We successfully apply them in three different tasks involving customer call satisfaction prediction and topic classification, and obtain a significant improvement over the baseline models in two of them.

* Automatic Speech Recognition and Understanding Workshop, 2019
* 4 figures, 7 pages

Via

Access Paper or Ask Questions