Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech": models, code, and papers

EEG based Continuous Speech Recognition using Transformers

Dec 31, 2019
Gautam Krishna, Co Tran, Mason Carnahan, Ahmed H Tewfik

Figure 1 for EEG based Continuous Speech Recognition using Transformers

Figure 2 for EEG based Continuous Speech Recognition using Transformers

Figure 3 for EEG based Continuous Speech Recognition using Transformers

Figure 4 for EEG based Continuous Speech Recognition using Transformers

In this paper we investigate continuous speech recognition using electroencephalography (EEG) features using recently introduced end-to-end transformer based automatic speech recognition (ASR) model. Our results show that transformer based model demonstrate faster inference and training compared to recurrent neural network (RNN) based sequence-to-sequence EEG models but performance of the RNN based models were better than transformer based model during test time on a limited English vocabulary.

* Work in progress for submission to EUSIPCO 2020

Via

Access Paper or Ask Questions

Nonwords Pronunciation Classification in Language Development Tests for Preschool Children

Jun 17, 2022
Ilja Baumann, Dominik Wagner, Sebastian Bayerl, Tobias Bocklet

Figure 1 for Nonwords Pronunciation Classification in Language Development Tests for Preschool Children

Figure 2 for Nonwords Pronunciation Classification in Language Development Tests for Preschool Children

Figure 3 for Nonwords Pronunciation Classification in Language Development Tests for Preschool Children

Figure 4 for Nonwords Pronunciation Classification in Language Development Tests for Preschool Children

This work aims to automatically evaluate whether the language development of children is age-appropriate. Validated speech and language tests are used for this purpose to test the auditory memory. In this work, the task is to determine whether spoken nonwords have been uttered correctly. We compare different approaches that are motivated to model specific language structures: Low-level features (FFT), speaker embeddings (ECAPA-TDNN), grapheme-motivated embeddings (wav2vec 2.0), and phonetic embeddings in form of senones (ASR acoustic model). Each of the approaches provides input for VGG-like 5-layer CNN classifiers. We also examine the adaptation per nonword. The evaluation of the proposed systems was performed using recordings from different kindergartens of spoken nonwords. ECAPA-TDNN and low-level FFT features do not explicitly model phonetic information; wav2vec2.0 is trained on grapheme labels, our ASR acoustic model features contain (sub-)phonetic information. We found that the more granular the phonetic modeling is, the higher are the achieved recognition rates. The best system trained on ASR acoustic model features with VTLN achieved an accuracy of 89.4% and an area under the ROC (Receiver Operating Characteristic) curve (AUC) of 0.923. This corresponds to an improvement in accuracy of 20.2% and AUC of 0.309 relative compared to the FFT-baseline.

* Accepted at Interspeech 2022

Via

Access Paper or Ask Questions

A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

Jun 18, 2019
Hieu-Thi Luong, Junichi Yamagishi

Figure 1 for A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

Figure 2 for A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

Figure 3 for A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

Figure 4 for A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

By representing speaker characteristic as a single fixed-length vector extracted solely from speech, we can train a neural multi-speaker speech synthesis model by conditioning the model on those vectors. This model can also be adapted to unseen speakers regardless of whether the transcript of adaptation data is available or not. However, this setup restricts the speaker component to just a single bias vector, which in turn limits the performance of adaptation process. In this study, we propose a novel speech synthesis model, which can be adapted to unseen speakers by fine-tuning part of or all of the network using either transcribed or untranscribed speech. Our methodology essentially consists of two steps: first, we split the conventional acoustic model into a speaker-independent (SI) linguistic encoder and a speaker-adaptive (SA) acoustic decoder; second, we train an auxiliary acoustic encoder that can be used as a substitute for the linguistic encoder whenever linguistic features are unobtainable. The results of objective and subjective evaluations show that adaptation using either transcribed or untranscribed speech with our methodology achieved a reasonable level of performance with an extremely limited amount of data and greatly improved performance with more data. Surprisingly, adaptation with untranscribed speech surpassed the transcribed counterpart in the subjective test, which reveals the limitations of the conventional acoustic model and hints at potential directions for improvements.

* Submitted to IEEE/ACM TASLP

Via

Access Paper or Ask Questions

Bayesian Transformer Language Models for Speech Recognition

Feb 09, 2021
Boyang Xue, Jianwei Yu, Junhao Xu, Shansong Liu, Shoukang Hu, Zi Ye, Mengzhe Geng, Xunying Liu, Helen Meng

Figure 1 for Bayesian Transformer Language Models for Speech Recognition

Figure 2 for Bayesian Transformer Language Models for Speech Recognition

Figure 3 for Bayesian Transformer Language Models for Speech Recognition

Figure 4 for Bayesian Transformer Language Models for Speech Recognition

State-of-the-art neural language models (LMs) represented by Transformers are highly complex. Their use of fixed, deterministic parameter estimates fail to account for model uncertainty and lead to over-fitting and poor generalization when given limited training data. In order to address these issues, this paper proposes a full Bayesian learning framework for Transformer LM estimation. Efficient variational inference based approaches are used to estimate the latent parameter posterior distributions associated with different parts of the Transformer model architecture including multi-head self-attention, feed forward and embedding layers. Statistically significant word error rate (WER) reductions up to 0.5\% absolute (3.18\% relative) and consistent perplexity gains were obtained over the baseline Transformer LMs on state-of-the-art Switchboard corpus trained LF-MMI factored TDNN systems with i-Vector speaker adaptation. Performance improvements were also obtained on a cross domain LM adaptation task requiring porting a Transformer LM trained on the Switchboard and Fisher data to a low-resource DementiaBank elderly speech corpus.

Via

Access Paper or Ask Questions

NLP-CUET@LT-EDI-EACL2021: Multilingual Code-Mixed Hope Speech Detection using Cross-lingual Representation Learner

Feb 28, 2021
Eftekhar Hossain, Omar Sharif, Mohammed Moshiul Hoque

Figure 1 for NLP-CUET@LT-EDI-EACL2021: Multilingual Code-Mixed Hope Speech Detection using Cross-lingual Representation Learner

Figure 2 for NLP-CUET@LT-EDI-EACL2021: Multilingual Code-Mixed Hope Speech Detection using Cross-lingual Representation Learner

Figure 3 for NLP-CUET@LT-EDI-EACL2021: Multilingual Code-Mixed Hope Speech Detection using Cross-lingual Representation Learner

Figure 4 for NLP-CUET@LT-EDI-EACL2021: Multilingual Code-Mixed Hope Speech Detection using Cross-lingual Representation Learner

In recent years, several systems have been developed to regulate the spread of negativity and eliminate aggressive, offensive or abusive contents from the online platforms. Nevertheless, a limited number of researches carried out to identify positive, encouraging and supportive contents. In this work, our goal is to identify whether a social media post/comment contains hope speech or not. We propose three distinct models to identify hope speech in English, Tamil and Malayalam language to serve this purpose. To attain this goal, we employed various machine learning (support vector machine, logistic regression, ensemble), deep learning (convolutional neural network + long short term memory) and transformer (m-BERT, Indic-BERT, XLNet, XLM-Roberta) based methods. Results indicate that XLM-Roberta outdoes all other techniques by gaining a weighted $f_1$-score of $0.93$, $0.60$ and $0.85$ respectively for English, Tamil and Malayalam language. Our team has achieved $1^{st}$, $2^{nd}$ and $1^{st}$ rank in these three tasks respectively.

* Winner LT-EDI workshop EACL-2021, 7 pages

Via

Access Paper or Ask Questions

Biometric Signature Verification Using Recurrent Neural Networks

May 03, 2022
Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Javier Ortega-Garcia

Figure 1 for Biometric Signature Verification Using Recurrent Neural Networks

Figure 2 for Biometric Signature Verification Using Recurrent Neural Networks

Figure 3 for Biometric Signature Verification Using Recurrent Neural Networks

Figure 4 for Biometric Signature Verification Using Recurrent Neural Networks

Architectures based on Recurrent Neural Networks (RNNs) have been successfully applied to many different tasks such as speech or handwriting recognition with state-of-the-art results. The main contribution of this work is to analyse the feasibility of RNNs for on-line signature verification in real practical scenarios. We have considered a system based on Long Short-Term Memory (LSTM) with a Siamese architecture whose goal is to learn a similarity metric from pairs of signatures. For the experimental work, the BiosecurID database comprised of 400 users and 4 separated acquisition sessions are considered. Our proposed LSTM RNN system has outperformed the results of recent published works on the BiosecurID benchmark in figures ranging from 17.76% to 28.00% relative verification performance improvement for skilled forgeries.

* Proc. The 14th IAPR International Conference on Document Analysis and Recognition, 2017

Via

Access Paper or Ask Questions

Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech

Jul 15, 2021
Christian J. Steinmetz, Vamsi Krishna Ithapu, Paul Calamia

Figure 1 for Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech

Figure 2 for Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech

Figure 3 for Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech

Figure 4 for Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech

Deep learning approaches have emerged that aim to transform an audio signal so that it sounds as if it was recorded in the same room as a reference recording, with applications both in audio post-production and augmented reality. In this work, we propose FiNS, a Filtered Noise Shaping network that directly estimates the time domain room impulse response (RIR) from reverberant speech. Our domain-inspired architecture features a time domain encoder and a filtered noise shaping decoder that models the RIR as a summation of decaying filtered noise signals, along with direct sound and early reflection components. Previous methods for acoustic matching utilize either large models to transform audio to match the target room or predict parameters for algorithmic reverberators. Instead, blind estimation of the RIR enables efficient and realistic transformation with a single convolution. An evaluation demonstrates our model not only synthesizes RIRs that match parameters of the target room, such as the $T_{60}$ and DRR, but also more accurately reproduces perceptual characteristics of the target room, as shown in a listening test when compared to deep learning baselines.

* Accepted to WASPAA 2021. See details at https://facebookresearch.github.io/FiNS/

Via

Access Paper or Ask Questions

Adversarial Attacks and Defenses for Speech Recognition Systems

Mar 31, 2021
Piotr Żelasko, Sonal Joshi, Yiwen Shao, Jesus Villalba, Jan Trmal, Najim Dehak, Sanjeev Khudanpur

Figure 1 for Adversarial Attacks and Defenses for Speech Recognition Systems

The ubiquitous presence of machine learning systems in our lives necessitates research into their vulnerabilities and appropriate countermeasures. In particular, we investigate the effectiveness of adversarial attacks and defenses against automatic speech recognition (ASR) systems. We select two ASR models - a thoroughly studied DeepSpeech model and a more recent Espresso framework Transformer encoder-decoder model. We investigate two threat models: a denial-of-service scenario where fast gradient-sign method (FGSM) or weak projected gradient descent (PGD) attacks are used to degrade the model's word error rate (WER); and a targeted scenario where a more potent imperceptible attack forces the system to recognize a specific phrase. We find that the attack transferability across the investigated ASR systems is limited. To defend the model, we use two preprocessing defenses: randomized smoothing and WaveGAN-based vocoder, and find that they significantly improve the model's adversarial robustness. We show that a WaveGAN vocoder can be a useful countermeasure to adversarial attacks on ASR systems - even when it is jointly attacked with the ASR, the target phrases' word error rate is high.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

Large scale evaluation of importance maps in automatic speech recognition

May 21, 2020
Viet Anh Trinh, Michael I Mandel

Figure 1 for Large scale evaluation of importance maps in automatic speech recognition

Figure 2 for Large scale evaluation of importance maps in automatic speech recognition

Figure 3 for Large scale evaluation of importance maps in automatic speech recognition

Figure 4 for Large scale evaluation of importance maps in automatic speech recognition

In this paper, we propose a metric that we call the structured saliency benchmark (SSBM) to evaluate importance maps computed for automatic speech recognizers on individual utterances. These maps indicate time-frequency points of the utterance that are most important for correct recognition of a target word. Our evaluation technique is not only suitable for standard classification tasks, but is also appropriate for structured prediction tasks like sequence-to-sequence models. Additionally, we use this approach to perform a large scale comparison of the importance maps created by our previously introduced technique using "bubble noise" to identify important points through correlation with a baseline approach based on smoothed speech energy and forced alignment. Our results show that the bubble analysis approach is better at identifying important speech regions than this baseline on 100 sentences from the AMI corpus.

* submitted to INTERSPEECH 2020

Via

Access Paper or Ask Questions

Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

Oct 28, 2020
Kun Zhou, Berrak Sisman, Rui Liu, Haizhou Li

Figure 1 for Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

Figure 2 for Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

Figure 3 for Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

Figure 4 for Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset

Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity. Prior studies show that it is possible to disentangle emotional prosody using an encoder-decoder network conditioned on discrete representation, such as one-hot emotion labels. Such networks learn to remember a fixed set of emotional styles. In this paper, we propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN), which makes use of a pre-trained speech emotion recognition (SER) model to transfer emotional style during training and at run-time inference. In this way, the network is able to transfer both seen and unseen emotional style to a new utterance. We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework. This paper also marks the release of an emotional speech dataset (ESD) for voice conversion, which has multiple speakers and languages.

* Submitted to ICASSP 2021

Via

Access Paper or Ask Questions