Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech": models, code, and papers

Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks

Sep 22, 2022
Cassia Valentini-Botinhao, Manuel Sam Ribeiro, Oliver Watts, Korin Richmond, Gustav Eje Henter

Figure 1 for Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks

Figure 2 for Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks

Figure 3 for Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks

Figure 4 for Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks

Automatically predicting the outcome of subjective listening tests is a challenging task. Ratings may vary from person to person even if preferences are consistent across listeners. While previous work has focused on predicting listeners' ratings (mean opinion scores) of individual stimuli, we focus on the simpler task of predicting subjective preference given two speech stimuli for the same text. We propose a model based on anti-symmetric twin neural networks, trained on pairs of waveforms and their corresponding preference scores. We explore both attention and recurrent neural nets to account for the fact that stimuli in a pair are not time aligned. To obtain a large training set we convert listeners' ratings from MUSHRA tests to values that reflect how often one stimulus in the pair was rated higher than the other. Specifically, we evaluate performance on data obtained from twelve MUSHRA evaluations conducted over five years, containing different TTS systems, built from data of different speakers. Our results compare favourably to a state-of-the-art model trained to predict MOS scores.

* Proceedings of INTERSPEECH 2022

Via

Access Paper or Ask Questions

Towards the evaluation of simultaneous speech translation from a communicative perspective

Mar 15, 2021
claudio Fantinuoli, Bianca Prandi

Figure 1 for Towards the evaluation of simultaneous speech translation from a communicative perspective

Figure 2 for Towards the evaluation of simultaneous speech translation from a communicative perspective

Figure 3 for Towards the evaluation of simultaneous speech translation from a communicative perspective

In recent years, machine speech-to-speech and speech-to-text translation has gained momentum thanks to advances in artificial intelligence, especially in the domains of speech recognition and machine translation. The quality of such applications is commonly tested with automatic metrics, such as BLEU, primarily with the goal of assessing improvements of releases or in the context of evaluation campaigns. However, little is known about how such systems compare to human performances in similar communicative tasks or how the performance of such systems is perceived by final users. In this paper, we present the results of an experiment aimed at evaluating the quality of a simultaneous speech translation engine by comparing it to the performance of professional interpreters. To do so, we select a framework developed for the assessment of human interpreters and use it to perform a manual evaluation on both human and machine performances. In our sample, we found better performance for the human interpreters in terms of intelligibility, while the machine performs slightly better in terms of informativeness. The limitations of the study and the possible enhancements of the chosen framework are discussed. Despite its intrinsic limitations, the use of this framework represents a first step towards a user-centric and communication-oriented methodology for evaluating simultaneous speech translation.

Via

Access Paper or Ask Questions

UserLibri: A Dataset for ASR Personalization Using Only Text

Jul 02, 2022
Theresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey

Figure 1 for UserLibri: A Dataset for ASR Personalization Using Only Text

Figure 2 for UserLibri: A Dataset for ASR Personalization Using Only Text

Figure 3 for UserLibri: A Dataset for ASR Personalization Using Only Text

Figure 4 for UserLibri: A Dataset for ASR Personalization Using Only Text

Personalization of speech models on mobile devices (on-device personalization) is an active area of research, but more often than not, mobile devices have more text-only data than paired audio-text data. We explore training a personalized language model on text-only data, used during inference to improve speech recognition performance for that user. We experiment on a user-clustered LibriSpeech corpus, supplemented with personalized text-only data for each user from Project Gutenberg. We release this User-Specific LibriSpeech (UserLibri) dataset to aid future personalization research. LibriSpeech audio-transcript pairs are grouped into 55 users from the test-clean dataset and 52 users from test-other. We are able to lower the average word error rate per user across both sets in streaming and nonstreaming models, including an improvement of 2.5 for the harder set of test-other users when streaming.

* Accepted for publication in Interspeech 2022. 9 total pages with appendix, 9 total tables, 5 total figures

Via

Access Paper or Ask Questions

End-to-end label uncertainty modeling for speech emotion recognition using Bayesian neural networks

Oct 07, 2021
Navin Raj Prabhu, Guillaume Carbajal, Nale Lehmann-Willenbrock, Timo Gerkmann

Figure 1 for End-to-end label uncertainty modeling for speech emotion recognition using Bayesian neural networks

Figure 2 for End-to-end label uncertainty modeling for speech emotion recognition using Bayesian neural networks

Figure 3 for End-to-end label uncertainty modeling for speech emotion recognition using Bayesian neural networks

Figure 4 for End-to-end label uncertainty modeling for speech emotion recognition using Bayesian neural networks

Emotions are subjective constructs. Recent end-to-end speech emotion recognition systems are typically agnostic to the subjective nature of emotions, despite their state-of-the-art performances. In this work, we introduce an end-to-end Bayesian neural network architecture to capture the inherent subjectivity in emotions. To the best of our knowledge, this work is the first to use Bayesian neural networks for speech emotion recognition. At training, the network learns a distribution of weights to capture the inherent uncertainty related to subjective emotion annotations. For this, we introduce a loss term which enables the model to be explicitly trained on a distribution of emotion annotations, rather than training them exclusively on mean or gold-standard labels. We evaluate the proposed approach on the AVEC'16 emotion recognition dataset. Qualitative and quantitative analysis of the results reveal that the proposed model can aptly capture the distribution of subjective emotion annotations with a compromise between mean and standard deviation estimations.

* (c) 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Via

Access Paper or Ask Questions

FrAUG: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals

Feb 11, 2022
Vijay Ravi, Jinhan Wang, Jonathan Flint, Abeer Alwan

Figure 1 for FrAUG: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals

Figure 2 for FrAUG: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals

Figure 3 for FrAUG: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals

Figure 4 for FrAUG: A Frame Rate Based Data Augmentation Method for Depression Detection from Speech Signals

In this paper, a data augmentation method is proposed for depression detection from speech signals. Samples for data augmentation were created by changing the frame-width and the frame-shift parameters during the feature extraction process. Unlike other data augmentation methods (such as VTLP, pitch perturbation, or speed perturbation), the proposed method does not explicitly change acoustic parameters but rather the time-frequency resolution of frame-level features. The proposed method was evaluated using two different datasets, models, and input acoustic features. For the DAIC-WOZ (English) dataset when using the DepAudioNet model and mel-Spectrograms as input, the proposed method resulted in an improvement of 5.97% (validation) and 25.13% (test) when compared to the baseline. The improvements for the CONVERGE (Mandarin) dataset when using the x-vector embeddings with CNN as the backend and MFCCs as input features were 9.32% (validation) and 12.99% (test). Baseline systems do not incorporate any data augmentation. Further, the proposed method outperformed commonly used data-augmentation methods such as noise augmentation, VTLP, Speed, and Pitch Perturbation. All improvements were statistically significant.

Via

Access Paper or Ask Questions

ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

Sep 01, 2021
Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, Héctor Delgado

Figure 1 for ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

Figure 2 for ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

Figure 3 for ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

Figure 4 for ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to promote the study of spoofing and the design of countermeasures to protect automatic speaker verification systems from manipulation. In addition to a continued focus upon logical and physical access tasks in which there are a number of advances compared to previous editions, ASVspoof 2021 introduces a new task involving deepfake speech detection. This paper describes all three tasks, the new databases for each of them, the evaluation metrics, four challenge baselines, the evaluation platform and a summary of challenge results. Despite the introduction of channel and compression variability which compound the difficulty, results for the logical access and deepfake tasks are close to those from previous ASVspoof editions. Results for the physical access task show the difficulty in detecting attacks in real, variable physical spaces. With ASVspoof 2021 being the first edition for which participants were not provided with any matched training or development data and with this reflecting real conditions in which the nature of spoofed and deepfake speech can never be predicated with confidence, the results are extremely encouraging and demonstrate the substantial progress made in the field in recent years.

* Accepted to the ASVspoof 2021 Workshop

Via

Access Paper or Ask Questions

Noise Classification Aided Attention-Based Neural Network for Monaural Speech Enhancement

May 31, 2021
Lu Ma, Song Yang, Yaguang Gong, Zhongqin Wu

Figure 1 for Noise Classification Aided Attention-Based Neural Network for Monaural Speech Enhancement

Figure 2 for Noise Classification Aided Attention-Based Neural Network for Monaural Speech Enhancement

Figure 3 for Noise Classification Aided Attention-Based Neural Network for Monaural Speech Enhancement

Figure 4 for Noise Classification Aided Attention-Based Neural Network for Monaural Speech Enhancement

This paper proposes an noise type classification aided attention-based neural network approach for monaural speech enhancement. The network is constructed based on a previous work by introducing a noise classification subnetwork into the structure and taking the classification embedding into the attention mechanism for guiding the network to make better feature extraction. Specifically, to make the network an end-to-end way, an audio encoder and decoder constructed by temporal convolution is used to make transformation between waveform and spectrogram. Additionally, our model is composed of two long short term memory (LSTM) based encoders, two attention mechanism, a noise classifier and a speech mask generator. Experiments show that, compared with OM-LSA and the previous work, the proposed noise classification aided attention-based approach can achieve better performance in terms of speech quality (PESQ). More promisingly, our approach has better generalization ability to unseen noise conditions.

* 2 figures, 3 tables

Via

Access Paper or Ask Questions

Generalized Representations Learning for Time Series Classification

Sep 15, 2022
Wang Lu, Jindong Wang, Xinwei Sun, Yiqiang Chen, Xing Xie

Figure 1 for Generalized Representations Learning for Time Series Classification

Figure 2 for Generalized Representations Learning for Time Series Classification

Figure 3 for Generalized Representations Learning for Time Series Classification

Figure 4 for Generalized Representations Learning for Time Series Classification

Time series classification is an important problem in real world. Due to its non-stationary property that the distribution changes over time, it remains challenging to build models for generalization to unseen distributions. In this paper, we propose to view the time series classification problem from the distribution perspective. We argue that the temporal complexity attributes to the unknown latent distributions within. To this end, we propose DIVERSIFY to learn generalized representations for time series classification. DIVERSIFY takes an iterative process: it first obtains the worst-case distribution scenario via adversarial training, then matches the distributions of the obtained sub-domains. We also present some theoretical insights. We conduct experiments on gesture recognition, speech commands recognition, wearable stress and affect detection, and sensor-based human activity recognition with a total of seven datasets in different settings. Results demonstrate that DIVERSIFY significantly outperforms other baselines and effectively characterizes the latent distributions by qualitative and quantitative analysis.

* Technical report; 18 pages

Via

Access Paper or Ask Questions

EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Feb 07, 2021
Yu-Wen Chen, Kuo-Hsuan Hung, Shang-Yi Chuang, Jonathan Sherman, Wen-Chin Huang, Xugang Lu, Yu Tsao

Figure 1 for EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Figure 2 for EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Figure 3 for EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Figure 4 for EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Synthesized speech from articulatory movements can have real-world use for patients with vocal cord disorders, situations requiring silent speech, or in high-noise environments. In this work, we present EMA2S, an end-to-end multimodal articulatory-to-speech system that directly converts articulatory movements to speech signals. We use a neural-network-based vocoder combined with multimodal joint-training, incorporating spectrogram, mel-spectrogram, and deep features. The experimental results confirm that the multimodal approach of EMA2S outperforms the baseline system in terms of both objective evaluation and subjective evaluation metrics. Moreover, results demonstrate that joint mel-spectrogram and deep feature loss training can effectively improve system performance.

Via

Access Paper or Ask Questions

Audio-Visual Speech Inpainting with Deep Learning

Oct 09, 2020
Giovanni Morrone, Daniel Michelsanti, Zheng-Hua Tan, Jesper Jensen

Figure 1 for Audio-Visual Speech Inpainting with Deep Learning

Figure 2 for Audio-Visual Speech Inpainting with Deep Learning

Figure 3 for Audio-Visual Speech Inpainting with Deep Learning

In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.

Via

Access Paper or Ask Questions