Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Effective Sentence Scoring Method using Bidirectional Language Model for Speech Recognition

May 16, 2019
Joongbo Shin, Yoonhyung Lee, Kyomin Jung

In automatic speech recognition, many studies have shown performance improvements using language models (LMs). Recent studies have tried to use bidirectional LMs (biLMs) instead of conventional unidirectional LMs (uniLMs) for rescoring the $N$-best list decoded from the acoustic model. In spite of their theoretical benefits, the biLMs have not given notable improvements compared to the uniLMs in their experiments. This is because their biLMs do not consider the interaction between the two directions. In this paper, we propose a novel sentence scoring method considering the interaction between the past and the future words on the biLM. Our experimental results on the LibriSpeech corpus show that the biLM with the proposed sentence scoring outperforms the uniLM for the $N$-best list rescoring, consistently and significantly in all experimental conditions. The analysis of WERs by word position demonstrates that the biLM is more robust than the uniLM especially when a recognized sentence is short or a misrecognized word is at the beginning of the sentence.

* submitted to INTERSPEECH 2019, 5 pages 

  Access Paper or Ask Questions

Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Aug 16, 2021
Sarala Padi, Seyed Omid Sadjadi, Dinesh Manocha, Ram D. Sriram

Automatic speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity, i.e., insufficient amounts of carefully labeled data to build and fully explore complex deep learning models for emotion classification. This paper aims to address this challenge using a transfer learning strategy combined with spectrogram augmentation. Specifically, we propose a transfer learning approach that leverages a pre-trained residual network (ResNet) model including a statistics pooling layer from speaker recognition trained using large amounts of speaker-labeled data. The statistics pooling layer enables the model to efficiently process variable-length input, thereby eliminating the need for sequence truncation which is commonly used in SER systems. In addition, we adopt a spectrogram augmentation technique to generate additional training data samples by applying random time-frequency masks to log-mel spectrograms to mitigate overfitting and improve the generalization of emotion recognition models. We evaluate the effectiveness of our proposed approach on the interactive emotional dyadic motion capture (IEMOCAP) dataset. Experimental results indicate that the transfer learning and spectrogram augmentation approaches improve the SER performance, and when combined achieve state-of-the-art results.

* Accepted at ACM/SIGCHI ICMI'21 

  Access Paper or Ask Questions

Reliable Part-of-Speech Tagging of Historical Corpora through Set-Valued Prediction

Aug 04, 2020
Stefan Heid, Marcel Wever, Eyke Hüllermeier

Syntactic annotation of corpora in the form of part-of-speech (POS) tags is a key requirement for both linguistic research and subsequent automated natural language processing (NLP) tasks. This problem is commonly tackled using machine learning methods, i.e., by training a POS tagger on a sufficiently large corpus of labeled data. While the problem of POS tagging can essentially be considered as solved for modern languages, historical corpora turn out to be much more difficult, especially due to the lack of native speakers and sparsity of training data. Moreover, most texts have no sentences as we know them today, nor a common orthography. These irregularities render the task of automated POS tagging more difficult and error-prone. Under these circumstances, instead of forcing the POS tagger to predict and commit to a single tag, it should be enabled to express its uncertainty. In this paper, we consider POS tagging within the framework of set-valued prediction, which allows the POS tagger to express its uncertainty via predicting a set of candidate POS tags instead of guessing a single one. The goal is to guarantee a high confidence that the correct POS tag is included while keeping the number of candidates small. In our experimental study, we find that extending state-of-the-art POS taggers to set-valued prediction yields more precise and robust taggings, especially for unknown words, i.e., words not occurring in the training data.

* 14 pages, 8 figures 

  Access Paper or Ask Questions

Separate but Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data

May 11, 2021
Efthymios Tzinis, Jonah Casebeer, Zhepei Wang, Paris Smaragdis

We propose FEDENHANCE, an unsupervised federated learning (FL) approach for speech enhancement and separation with non-IID distributed data across multiple clients. We simulate a real-world scenario where each client only has access to a few noisy recordings from a limited and disjoint number of speakers (hence non-IID). Each client trains their model in isolation using mixture invariant training while periodically providing updates to a central server. Our experiments show that our approach achieves competitive enhancement performance compared to IID training on a single device and that we can further facilitate the convergence speed and the overall performance using transfer learning on the server-side. Moreover, we show that we can effectively combine updates from clients trained locally with supervised and unsupervised losses. We also release a new dataset LibriFSD50K and its creation recipe in order to facilitate FL research for source separation problems.

* Submitted to WASPAA 21 

  Access Paper or Ask Questions

Emotion Recognition System from Speech and Visual Information based on Convolutional Neural Networks

Feb 29, 2020
Nicolae-Catalin Ristea, Liviu Cristian Dutu, Anamaria Radoi

Emotion recognition has become an important field of research in the human-computer interactions domain. The latest advancements in the field show that combining visual with audio information lead to better results if compared to the case of using a single source of information separately. From a visual point of view, a human emotion can be recognized by analyzing the facial expression of the person. More precisely, the human emotion can be described through a combination of several Facial Action Units. In this paper, we propose a system that is able to recognize emotions with a high accuracy rate and in real time, based on deep Convolutional Neural Networks. In order to increase the accuracy of the recognition system, we analyze also the speech data and fuse the information coming from both sources, i.e., visual and audio. Experimental results show the effectiveness of the proposed scheme for emotion recognition and the importance of combining visual with audio data.

  Access Paper or Ask Questions

M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues

Nov 22, 2019
Trisha Mittal, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, Dinesh Manocha

We present M3ER, a learning-based method for emotion recognition from multiple input modalities. Our approach combines cues from multiple co-occurring modalities (such as face, text, and speech) and also is more robust than other methods to sensor noise in any of the individual modalities. M3ER models a novel, data-driven multiplicative fusion method to combine the modalities, which learn to emphasize the more reliable cues and suppress others on a per-sample basis. By introducing a check step which uses Canonical Correlational Analysis to differentiate between ineffective and effective modalities, M3ER is robust to sensor noise. M3ER also generates proxy features in place of the ineffectual modalities. We demonstrate the efficiency of our network through experimentation on two benchmark datasets, IEMOCAP and CMU-MOSEI. We report a mean accuracy of 82.7% on IEMOCAP and 89.0% on CMU-MOSEI, which, collectively, is an improvement of about 5% over prior work.

  Access Paper or Ask Questions