Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition

Oct 09, 2021
Han Zhu, Li Wang, Ying Hou, Jindong Wang, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan

Self-supervised pre-training has dramatically improved the performance of automatic speech recognition (ASR). However, most existing self-supervised pre-training approaches are task-agnostic, i.e., could be applied to various downstream tasks. And there is a gap between the task-agnostic pre-training and the task-specific downstream fine-tuning, which may degrade the downstream performance. In this work, we propose a novel pre-training paradigm called wav2vec-S, where we use task-specific semi-supervised pre-training to bridge this gap. Specifically, the semi-supervised pre-training is conducted on the basis of self-supervised pre-training such as wav2vec 2.0. Experiments on ASR show that compared to wav2vec 2.0, wav2vec-S only requires marginal increment of pre-training time but could significantly improve ASR performance on in-domain, cross-domain and cross-lingual datasets. The average relative WER reductions are 26.3% and 6.3% for 1h and 10h fine-tuning, respectively.

  Access Paper or Ask Questions

Efficient Low-Latency Speech Enhancement with Mobile Audio Streaming Networks

Aug 17, 2020
Michał Romaniuk, Piotr Masztalski, Karol Piaskowski, Mateusz Matuszewski

We propose Mobile Audio Streaming Networks (MASnet) for efficient low-latency speech enhancement, which is particularly suitable for mobile devices and other applications where computational capacity is a limitation. MASnet processes linear-scale spectrograms, transforming successive noisy frames into complex-valued ratio masks which are then applied to the respective noisy frames. MASnet can operate in a low-latency incremental inference mode which matches the complexity of layer-by-layer batch mode. Compared to a similar fully-convolutional architecture, MASnet incorporates depthwise and pointwise convolutions for a large reduction in fused multiply-accumulate operations per second (FMA/s), at the cost of some reduction in SNR.

* Accepted for INTERSPEECH 2020 

  Access Paper or Ask Questions

Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron

Mar 24, 2018
RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J. Weiss, Rob Clark, Rif A. Saurous

We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and synthesis speakers are different. Additionally, we show that a reference prosody embedding can be used to synthesize text that is different from that of the reference utterance. We define several quantitative and subjective metrics for evaluating prosody transfer, and report results with accompanying audio samples from single-speaker and 44-speaker Tacotron models on a prosody transfer task.

  Access Paper or Ask Questions

Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals

May 08, 2018
Helen L Bear, Richard Harvey

Visual lip gestures observed whilst lipreading have a few working definitions, the most common two are; `the visual equivalent of a phoneme' and `phonemes which are indistinguishable on the lips'. To date there is no formal definition, in part because to date we have not established a two-way relationship or mapping between visemes and phonemes. Some evidence suggests that visual speech is highly dependent upon the speaker. So here, we use a phoneme-clustering method to form new phoneme-to-viseme maps for both individual and multiple speakers. We test these phoneme to viseme maps to examine how similarly speakers talk visually and we use signed rank tests to measure the distance between individuals. We conclude that broadly speaking, speakers have the same repertoire of mouth gestures, where they differ is in the use of the gestures.

* Computer Speech and Language, May 2018 

  Access Paper or Ask Questions

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

Aug 09, 2019
Michael Wray, Diane Larlus, Gabriela Csurka, Dima Damen

We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities. We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.

* Accepted for presentation at ICCV. Project Page: 

  Access Paper or Ask Questions

Hard Sample Mining for the Improved Retraining of Automatic Speech Recognition

Apr 17, 2019
Jiabin Xue, Jiqing Han, Tieran Zheng, Jiaxing Guo, Boyong Wu

It is an effective way that improves the performance of the existing Automatic Speech Recognition (ASR) systems by retraining with more and more new training data in the target domain. Recently, Deep Neural Network (DNN) has become a successful model in the ASR field. In the training process of the DNN based methods, a back propagation of error between the transcription and the corresponding annotated text is used to update and optimize the parameters. Thus, the parameters are more influenced by the training samples with a big propagation error than the samples with a small one. In this paper, we define the samples with significant error as the hard samples and try to improve the performance of the ASR system by adding many of them. Unfortunately, the hard samples are sparse in the training data of the target domain, and manually label them is expensive. Therefore, we propose a hard samples mining method based on an enhanced deep multiple instance learning, which can find the hard samples from unlabeled training data by using a small subset of the dataset with manual labeling in the target domain. We applied our method to an End2End ASR task and obtained the best performance.

* Submitted to Interspeech 2019; 

  Access Paper or Ask Questions

Hate Speech Detection from Code-mixed Hindi-English Tweets Using Deep Learning Models

Nov 13, 2018
Satyajit Kamble, Aditya Joshi

This paper reports an increment to the state-of-the-art in hate speech detection for English-Hindi code-mixed tweets. We compare three typical deep learning models using domain-specific embeddings. On experimenting with a benchmark dataset of English-Hindi code-mixed tweets, we observe that using domain-specific embeddings results in an improved representation of target groups, and an improved F-score.

* This paper will appear at the 15th International Conference on Natural Language Processing (ICON-2018) in India in December 2018. ICON is a premier NLP conference in India 

  Access Paper or Ask Questions

Definition of Visual Speech Element and Research on a Method of Extracting Feature Vector for Korean Lip-Reading

Nov 15, 2014
Ha Jong Won, Li Gwang Chol, Kim Hyok Chol, Li Kum Song

In this paper, we defined the viseme (visual speech element) and described about the method of extracting visual feature vector. We defined the 10 visemes based on vowel by analyzing of Korean utterance and proposed the method of extracting the 20-dimensional visual feature vector, combination of static features and dynamic features. Lastly, we took an experiment in recognizing words based on 3-viseme HMM and evaluated the efficiency.

  Access Paper or Ask Questions

More for Less: Non-Intrusive Speech Quality Assessment with Limited Annotations

Aug 19, 2021
Alessandro Ragano, Emmanouil Benetos, Andrew Hines

Non-intrusive speech quality assessment is a crucial operation in multimedia applications. The scarcity of annotated data and the lack of a reference signal represent some of the main challenges for designing efficient quality assessment metrics. In this paper, we propose two multi-task models to tackle the problems above. In the first model, we first learn a feature representation with a degradation classifier on a large dataset. Then we perform MOS prediction and degradation classification simultaneously on a small dataset annotated with MOS. In the second approach, the initial stage consists of learning features with a deep clustering-based unsupervised feature representation on the large dataset. Next, we perform MOS prediction and cluster label classification simultaneously on a small dataset. The results show that the deep clustering-based model outperforms the degradation classifier-based model and the 3 baselines (autoencoder features, P.563, and SRMRnorm) on TCD-VoIP. This paper indicates that multi-task learning combined with feature representations from unlabelled data is a promising approach to deal with the lack of large MOS annotated datasets.

* Published in 2021 13th International Conference on Quality of Multimedia Experience (QoMEX) 

  Access Paper or Ask Questions

Feature selection using Fisher's ratio technique for automatic speech recognition

May 13, 2015
Sarika Hegde, K. K. Achary, Surendra Shetty

Automatic Speech Recognition involves mainly two steps; feature extraction and classification . Mel Frequency Cepstral Coefficient is used as one of the prominent feature extraction techniques in ASR. Usually, the set of all 12 MFCC coefficients is used as the feature vector in the classification step. But the question is whether the same or improved classification accuracy can be achieved by using a subset of 12 MFCC as feature vector. In this paper, Fisher's ratio technique is used for selecting a subset of 12 MFCC coefficients that contribute more in discriminating a pattern. The selected coefficients are used in classification with Hidden Markov Model algorithm. The classification accuracies that we get by using 12 coefficients and by using the selected coefficients are compared.

* in International Journal on Cybernetics & Informatics (IJCI) Vol. 4, No. 2, April 2015 

  Access Paper or Ask Questions