Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech": models, code, and papers

Does Speech enhancement of publicly available data help build robust Speech Recognition Systems?

Oct 29, 2019
Bhavya Ghai, Buvana Ramanan, Klaus Mueller

Figure 1 for Does Speech enhancement of publicly available data help build robust Speech Recognition Systems?

Figure 2 for Does Speech enhancement of publicly available data help build robust Speech Recognition Systems?

Automatic speech recognition (ASR) systems play a key role in many commercial products including voice assistants. Typically, they require large amounts of clean speech data for training which gives an undue advantage to large organizations which have tons of private data. In this paper, we have first curated a fairly big dataset using publicly available data sources. Thereafter, we tried to investigate if we can use publicly available noisy data to train robust ASR systems. We have used speech enhancement to clean the noisy data first and then used it together with its cleaned version to train ASR systems. We have found that using speech enhancement gives 9.5\% better word error rate than training on just noisy data and 9\% better than training on just clean data. It's performance is also comparable to the ideal case scenario when trained on noisy and its clean version.

* Accepted to AAAI conference of Artificial Intelligence 2020 (abstract)

Via

Access Paper or Ask Questions

A Technical Report: BUT Speech Translation Systems

Oct 22, 2020
Hari Krishna Vydana, Lukas Burget, Jan Cernocky

Figure 1 for A Technical Report: BUT Speech Translation Systems

Figure 2 for A Technical Report: BUT Speech Translation Systems

Figure 3 for A Technical Report: BUT Speech Translation Systems

Figure 4 for A Technical Report: BUT Speech Translation Systems

The paper describes the BUT's speech translation systems. The systems are English$\longrightarrow$German offline speech translation systems. The systems are based on our previous works \cite{Jointly_trained_transformers}. Though End-to-End and cascade~(ASR-MT) spoken language translation~(SLT) systems are reaching comparable performances, a large degradation is observed when translating ASR hypothesis compared to the oracle input text. To reduce this performance degradation, we have jointly-trained ASR and MT modules with ASR objective as an auxiliary loss. Both the networks are connected through the neural hidden representations. This model has an End-to-End differentiable path with respect to the final objective function and also utilizes the ASR objective for better optimization. During the inference both the modules(i.e., ASR and MT) are connected through the hidden representations corresponding to the n-best hypotheses. Ensembling with independently trained ASR and MT models have further improved the performance of the system.

Via

Access Paper or Ask Questions

Continuous Speech Recognition using EEG and Video

Dec 24, 2019
Gautam Krishna, Mason Carnahan, Co Tran, Ahmed H Tewfik

Figure 1 for Continuous Speech Recognition using EEG and Video

Figure 2 for Continuous Speech Recognition using EEG and Video

Figure 3 for Continuous Speech Recognition using EEG and Video

Figure 4 for Continuous Speech Recognition using EEG and Video

In this paper we investigate whether electroencephalography (EEG) features can be used to improve the performance of continuous visual speech recognition systems. We implemented a connectionist temporal classification (CTC) based end-to-end automatic speech recognition (ASR) model for performing recognition. Our results demonstrate that EEG features are helpful in enhancing the performance of continuous visual speech recognition systems.

* On preparation for submission to EUSIPCO 2020. arXiv admin note: text overlap with arXiv:1911.11610, arXiv:1911.04261

Via

Access Paper or Ask Questions

Disentangling Active and Passive Cosponsorship in the U.S. Congress

May 19, 2022
Giuseppe Russo, Christoph Gote, Laurence Brandenberger, Sophia Schlosser, Frank Schweitzer

Figure 1 for Disentangling Active and Passive Cosponsorship in the U.S. Congress

Figure 2 for Disentangling Active and Passive Cosponsorship in the U.S. Congress

Figure 3 for Disentangling Active and Passive Cosponsorship in the U.S. Congress

Figure 4 for Disentangling Active and Passive Cosponsorship in the U.S. Congress

In the U.S. Congress, legislators can use active and passive cosponsorship to support bills. We show that these two types of cosponsorship are driven by two different motivations: the backing of political colleagues and the backing of the bill's content. To this end, we develop an Encoder+RGCN based model that learns legislator representations from bill texts and speech transcripts. These representations predict active and passive cosponsorship with an F1-score of 0.88. Applying our representations to predict voting decisions, we show that they are interpretable and generalize to unseen tasks.

* 20 pages, 10 figures, 6 tables

Via

Access Paper or Ask Questions

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Oct 12, 2020
Jungil Kong, Jaehyeon Kim, Jaekyoung Bae

Figure 1 for HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Figure 2 for HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Figure 3 for HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Figure 4 for HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Several recent studies on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this study, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real time on CPU with comparable quality to an autoregressive counterpart.

Via

Access Paper or Ask Questions

Bias-Aware Loss for Training Image and Speech Quality Prediction Models from Multiple Datasets

Apr 20, 2021
Gabriel Mittag, Saman Zadtootaghaj, Thilo Michael, Babak Naderi, Sebastian Möller

Figure 1 for Bias-Aware Loss for Training Image and Speech Quality Prediction Models from Multiple Datasets

Figure 2 for Bias-Aware Loss for Training Image and Speech Quality Prediction Models from Multiple Datasets

Figure 3 for Bias-Aware Loss for Training Image and Speech Quality Prediction Models from Multiple Datasets

Figure 4 for Bias-Aware Loss for Training Image and Speech Quality Prediction Models from Multiple Datasets

The ground truth used for training image, video, or speech quality prediction models is based on the Mean Opinion Scores (MOS) obtained from subjective experiments. Usually, it is necessary to conduct multiple experiments, mostly with different test participants, to obtain enough data to train quality models based on machine learning. Each of these experiments is subject to an experiment-specific bias, where the rating of the same file may be substantially different in two experiments (e.g. depending on the overall quality distribution). These different ratings for the same distortion levels confuse neural networks during training and lead to lower performance. To overcome this problem, we propose a bias-aware loss function that estimates each dataset's biases during training with a linear function and considers it while optimising the network weights. We prove the efficiency of the proposed method by training and validating quality prediction models on synthetic and subjective image and speech quality datasets.

* Accepted at QoMEX 2021

Via

Access Paper or Ask Questions

Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm

Jul 06, 2021
Elijah Gutierrez, Pilar Oplustil-Gallegos, Catherine Lai

Figure 1 for Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm

Figure 2 for Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm

Figure 3 for Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm

Figure 4 for Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm

Text-to-Speech synthesis systems are generally evaluated using Mean Opinion Score (MOS) tests, where listeners score samples of synthetic speech on a Likert scale. A major drawback of MOS tests is that they only offer a general measure of overall quality-i.e., the naturalness of an utterance-and so cannot tell us where exactly synthesis errors occur. This can make evaluation of the appropriateness of prosodic variation within utterances inconclusive. To address this, we propose a novel evaluation method based on the Rapid Prosody Transcription paradigm. This allows listeners to mark the locations of errors in an utterance in real-time, providing a probabilistic representation of the perceptual errors that occur in the synthetic signal. We conduct experiments that confirm that the fine-grained evaluation can be mapped to system rankings of standard MOS tests, but the error marking gives a much more comprehensive assessment of synthesized prosody. In particular, for standard audiobook test set samples, we see that error marks consistently cluster around words at major prosodic boundaries indicated by punctuation. However, for question-answer based stimuli, where we control information structure, we see differences emerge in the ability of neural TTS systems to generate context-appropriate prosodic prominence.

* Accepted to Speech Synthesis Workshop 2019: https://ssw11.hte.hu/en/

Via

Access Paper or Ask Questions

Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Aug 16, 2021
Sarala Padi, Seyed Omid Sadjadi, Dinesh Manocha, Ram D. Sriram

Figure 1 for Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Figure 2 for Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Figure 3 for Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Figure 4 for Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Automatic speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity, i.e., insufficient amounts of carefully labeled data to build and fully explore complex deep learning models for emotion classification. This paper aims to address this challenge using a transfer learning strategy combined with spectrogram augmentation. Specifically, we propose a transfer learning approach that leverages a pre-trained residual network (ResNet) model including a statistics pooling layer from speaker recognition trained using large amounts of speaker-labeled data. The statistics pooling layer enables the model to efficiently process variable-length input, thereby eliminating the need for sequence truncation which is commonly used in SER systems. In addition, we adopt a spectrogram augmentation technique to generate additional training data samples by applying random time-frequency masks to log-mel spectrograms to mitigate overfitting and improve the generalization of emotion recognition models. We evaluate the effectiveness of our proposed approach on the interactive emotional dyadic motion capture (IEMOCAP) dataset. Experimental results indicate that the transfer learning and spectrogram augmentation approaches improve the SER performance, and when combined achieve state-of-the-art results.

* Accepted at ACM/SIGCHI ICMI'21

Via

Access Paper or Ask Questions

Measuring Forgetting of Memorized Training Examples

Jun 30, 2022
Matthew Jagielski, Om Thakkar, Florian Tramèr, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Thakurta, Nicolas Papernot, Chiyuan Zhang

Figure 1 for Measuring Forgetting of Memorized Training Examples

Figure 2 for Measuring Forgetting of Memorized Training Examples

Figure 3 for Measuring Forgetting of Memorized Training Examples

Figure 4 for Measuring Forgetting of Memorized Training Examples

Machine learning models exhibit two seemingly contradictory phenomena: training data memorization and various forms of forgetting. In memorization, models overfit specific training examples and become susceptible to privacy attacks. In forgetting, examples which appeared early in training are forgotten by the end. In this work, we connect these phenomena. We propose a technique to measure to what extent models ``forget'' the specifics of training examples, becoming less susceptible to privacy attacks on examples they have not seen recently. We show that, while non-convexity can prevent forgetting from happening in the worst-case, standard image and speech models empirically do forget examples over time. We identify nondeterminism as a potential explanation, showing that deterministically trained models do not forget. Our results suggest that examples seen early when training with extremely large datasets -- for instance those examples used to pre-train a model -- may observe privacy benefits at the expense of examples seen later.

* 19 pages, 7 figures

Via

Access Paper or Ask Questions