Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Domain Adaptation and Autoencoder Based Unsupervised Speech Enhancement

Dec 09, 2021
Yi Li, Yang Sun, Kirill Horoshenkov, Syed Mohsen Naqvi

As a category of transfer learning, domain adaptation plays an important role in generalizing the model trained in one task and applying it to other similar tasks or settings. In speech enhancement, a well-trained acoustic model can be exploited to obtain the speech signal in the context of other languages, speakers, and environments. Recent domain adaptation research was developed more effectively with various neural networks and high-level abstract features. However, the related studies are more likely to transfer the well-trained model from a rich and more diverse domain to a limited and similar domain. Therefore, in this study, the domain adaptation method is proposed in unsupervised speech enhancement for the opposite circumstance that transferring to a larger and richer domain. On the one hand, the importance-weighting (IW) approach is exploited with a variance constrained autoencoder to reduce the shift of shared weights between the source and target domains. On the other hand, in order to train the classifier with the worst-case weights and minimize the risk, the minimax method is proposed. Both the proposed IW and minimax methods are evaluated from the VOICE BANK and IEEE datasets to the TIMIT dataset. The experiment results show that the proposed methods outperform the state-of-the-art approaches.

* IEEE Transactions on Artificial Intelligence. (2021) 

  Access Paper or Ask Questions

nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-shot Multi-speaker Text-to-Speech

Feb 22, 2022
Botao Zhao, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao

Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech, that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance. Compared with using a speaker representation module to extract the characteristics of new speakers, our method bases on a speaker-guided conditional variational autoencoder and can generate a variable Z, which contains both speaker characteristics and content information. The latent variable Z distribution is approximated by another variable conditioned on reference mel-spectrogram and phoneme. Experiments on the English corpus, Mandarin corpus, and cross-dataset proves that our model could generate natural and similar speech with only one adaption speech.

  Access Paper or Ask Questions

Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Mar 09, 2021
Samik Sadhu, Di He, Che-Wei Huang, Sri Harish Mallidi, Minhua Wu, Ariya Rastrow, Andreas Stolcke, Jasha Droppo, Roland Maas

Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2.0 network from the quantized representations in a way similar to a VQ-VAE model. The proposed self-supervised model is trained on 10k hours of unlabeled data and subsequently used as the speech encoder in a RNN-T ASR model and fine-tuned with 1k hours of labeled data. This work is one of only a few studies of self-supervised learning on speech tasks with a large volume of real far-field labeled data. The Wav2vec-C encoded representations achieves, on average, twice the error reduction over baseline and a higher codebook utilization in comparison to wav2vec 2.0

  Access Paper or Ask Questions

Are Chess Discussions Racist? An Adversarial Hate Speech Data Set

Nov 20, 2020
Rupak Sarkar, Ashiqur R. KhudaBukhsh

On June 28, 2020, while presenting a chess podcast on Grandmaster Hikaru Nakamura, Antonio Radi\'c's YouTube handle got blocked because it contained "harmful and dangerous" content. YouTube did not give further specific reason, and the channel got reinstated within 24 hours. However, Radi\'c speculated that given the current political situation, a referral to "black against white", albeit in the context of chess, earned him this temporary ban. In this paper, via a substantial corpus of 681,995 comments, on 8,818 YouTube videos hosted by five highly popular chess-focused YouTube channels, we ask the following research question: \emph{how robust are off-the-shelf hate-speech classifiers to out-of-domain adversarial examples?} We release a data set of 1,000 annotated comments where existing hate speech classifiers misclassified benign chess discussions as hate speech. We conclude with an intriguing analogy result on racial bias with our findings pointing out to the broader challenge of color polysemy.

  Access Paper or Ask Questions

TSTNN: Two-stage Transformer based Neural Network for Speech Enhancement in the Time Domain

Mar 18, 2021
Kai Wang, Bengbeng He, Wei-Ping Zhu

In this paper, we propose a transformer-based architecture, called two-stage transformer neural network (TSTNN) for end-to-end speech denoising in the time domain. The proposed model is composed of an encoder, a two-stage transformer module (TSTM), a masking module and a decoder. The encoder maps input noisy speech into feature representation. The TSTM exploits four stacked two-stage transformer blocks to efficiently extract local and global information from the encoder output stage by stage. The masking module creates a mask which will be multiplied with the encoder output. Finally, the decoder uses the masked encoder feature to reconstruct the enhanced speech. Experimental results on the benchmark dataset show that the TSTNN outperforms most state-of-the-art models in time or frequency domain while having significantly lower model complexity.

* 5 pages, 4 figures, accepted by IEEE ICASSP 2021 

  Access Paper or Ask Questions

Speaker Adapted Beamforming for Multi-Channel Automatic Speech Recognition

Jun 19, 2018
Tobias Menne, Ralf Schlüter, Hermann Ney

This paper presents, in the context of multi-channel ASR, a method to adapt a mask based, statistically optimal beamforming approach to a speaker of interest. The beamforming vector of the statistically optimal beamformer is computed by utilizing speech and noise masks, which are estimated by a neural network. The proposed adaptation approach is based on the integration of the beamformer, which includes the mask estimation network, and the acoustic model of the ASR system. This allows for the propagation of the training error, from the acoustic modeling cost function, all the way through the beamforming operation and through the mask estimation network. By using the results of a first pass recognition and by keeping all other parameters fixed, the mask estimation network can therefore be fine tuned by retraining. Utterances of a speaker of interest can thus be used in a two pass approach, to optimize the beamforming for the speech characteristics of that specific speaker. It is shown that this approach improves the ASR performance of a state-of-the-art multi-channel ASR system on the CHiME-4 data. Furthermore the effect of the adaptation on the estimated speech masks is discussed.

* submitted to IEEE SLT 2018 

  Access Paper or Ask Questions

Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech Recognition

May 19, 2020
George Sterpu, Christian Saam, Naomi Harte

The audio-visual speech fusion strategy AV Align has shown significant performance improvements in audio-visual speech recognition (AVSR) on the challenging LRS2 dataset. Performance improvements range between 7% and 30% depending on the noise level when leveraging the visual modality of speech in addition to the auditory one. This work presents a variant of AV Align where the recurrent Long Short-term Memory (LSTM) computation block is replaced by the more recently proposed Transformer block. We compare the two methods, discussing in greater detail their strengths and weaknesses. We find that Transformers also learn cross-modal monotonic alignments, but suffer from the same visual convergence problems as the LSTM model, calling for a deeper investigation into the dominant modality problem in machine learning.

* Submitted to INTERSPEECH 2020 

  Access Paper or Ask Questions

Optimal Transport-based Adaptation in Dysarthric Speech Tasks

Apr 06, 2021
Rosanna Turrisi, Leonardo Badino

In many real-world applications, the mismatch between distributions of training data (source) and test data (target) significantly degrades the performance of machine learning algorithms. In speech data, causes of this mismatch include different acoustic environments or speaker characteristics. In this paper, we address this issue in the challenging context of dysarthric speech, by multi-source domain/speaker adaptation (MSDA/MSSA). Specifically, we propose the use of an optimal-transport based approach, called MSDA via Weighted Joint Optimal Transport (MSDA-WDJOT). We confront the mismatch problem in dysarthria detection for which the proposed approach outperforms both the Baseline and the state-of-the-art MSDA models, improving the detection accuracy of 0.9% over the best competitor method. We then employ MSDA-WJDOT for dysarthric speaker adaptation in command speech recognition. This provides a Command Error Rate relative reduction of 16% and 7% over the baseline and the best competitor model, respectively. Interestingly, MSDA-WJDOT provides a similarity score between the source and the target, i.e. between speakers in this case. We leverage this similarity measure to define a Dysarthric and Healthy score of the target speaker and diagnose the dysarthria with an accuracy of 95%.

  Access Paper or Ask Questions

Emotionless: Privacy-Preserving Speech Analysis for Voice Assistants

Aug 09, 2019
Ranya Aloufi, Hamed Haddadi, David Boyle

Voice-enabled interactions provide more human-like experiences in many popular IoT systems. Cloud-based speech analysis services extract useful information from voice input using speech recognition techniques. The voice signal is a rich resource that discloses several possible states of a speaker, such as emotional state, confidence and stress levels, physical condition, age, gender, and personal traits. Service providers can build a very accurate profile of a user's demographic category, personal preferences, and may compromise privacy. To address this problem, a privacy-preserving intermediate layer between users and cloud services is proposed to sanitize the voice input. It aims to maintain utility while preserving user privacy. It achieves this by collecting real time speech data and analyzes the signal to ensure privacy protection prior to sharing of this data with services providers. Precisely, the sensitive representations are extracted from the raw signal by using transformation functions and then wrapped it via voice conversion technology. Experimental evaluation based on emotion recognition to assess the efficacy of the proposed method shows that identification of sensitive emotional state of the speaker is reduced by ~96 %.

* 5 pages, 4 figures, privacy Preserving Machine Learning Workshop, CCS 2019 

  Access Paper or Ask Questions

One to rule them all: Towards Joint Indic Language Hate Speech Detection

Sep 28, 2021
Mehar Bhatia, Tenzin Singhay Bhotia, Akshat Agarwal, Prakash Ramesh, Shubham Gupta, Kumar Shridhar, Felix Laumann, Ayushman Dash

This paper is a contribution to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) 2021 shared task. Social media today is a hotbed of toxic and hateful conversations, in various languages. Recent news reports have shown that current models struggle to automatically identify hate posted in minority languages. Therefore, efficiently curbing hate speech is a critical challenge and problem of interest. We present a multilingual architecture using state-of-the-art transformer language models to jointly learn hate and offensive speech detection across three languages namely, English, Hindi, and Marathi. On the provided testing corpora, we achieve Macro F1 scores of 0.7996, 0.7748, 0.8651 for sub-task 1A and 0.6268, 0.5603 during the fine-grained classification of sub-task 1B. These results show the efficacy of exploiting a multilingual training scheme.

* submitted to FIRE 2021 in the HASOC-FIRE shared task on hate speech and offensive language detection 

  Access Paper or Ask Questions