Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech": models, code, and papers

Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

Jul 02, 2022
Kun Wei, Pengcheng Guo, Ning Jiang

Figure 1 for Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

Figure 2 for Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

Figure 3 for Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

Figure 4 for Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

Transformer-based models have demonstrated their effectiveness in automatic speech recognition (ASR) tasks and even shown superior performance over the conventional hybrid framework. The main idea of Transformers is to capture the long-range global context within an utterance by self-attention layers. However, for scenarios like conversational speech, such utterance-level modeling will neglect contextual dependencies that span across utterances. In this paper, we propose to explicitly model the inter-sentential information in a Transformer based end-to-end architecture for conversational speech recognition. Specifically, for the encoder network, we capture the contexts of previous speech and incorporate such historic information into current input by a context-aware residual attention mechanism. For the decoder, the prediction of current utterance is also conditioned on the historic linguistic information through a conditional decoder framework. We show the effectiveness of our proposed method on several open-source dialogue corpora and the proposed method consistently improved the performance from the utterance-level Transformer-based ASR models.

* Accepted by Interspeech2022

Via

Access Paper or Ask Questions

SUPERB: Speech processing Universal PERformance Benchmark

May 03, 2021
Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe, Abdelrahman Mohamed, Hung-yi Lee

Figure 1 for SUPERB: Speech processing Universal PERformance Benchmark

Figure 2 for SUPERB: Speech processing Universal PERformance Benchmark

Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge this gap, we introduce Speech processing Universal PERformance Benchmark (SUPERB). SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, we especially focus on extracting the representation learned from SSL due to its preferable re-usability. We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model. Our results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks. We release SUPERB as a challenge with a leaderboard and a benchmark toolkit to fuel the research in representation learning and general speech processing.

* Submitted to Interspeech 2021

Via

Access Paper or Ask Questions

Quantifying Bias in Automatic Speech Recognition

Apr 01, 2021
Siyuan Feng, Olya Kudina, Bence Mark Halpern, Odette Scharenborg

Figure 1 for Quantifying Bias in Automatic Speech Recognition

Figure 2 for Quantifying Bias in Automatic Speech Recognition

Figure 3 for Quantifying Bias in Automatic Speech Recognition

Figure 4 for Quantifying Bias in Automatic Speech Recognition

Automatic speech recognition (ASR) systems promise to deliver objective interpretation of human speech. Practice and recent evidence suggests that the state-of-the-art (SotA) ASRs struggle with the large variation in speech due to e.g., gender, age, speech impairment, race, and accents. Many factors can cause the bias of an ASR system. Our overarching goal is to uncover bias in ASR systems to work towards proactive bias mitigation in ASR. This paper is a first step towards this goal and systematically quantifies the bias of a Dutch SotA ASR system against gender, age, regional accents and non-native accents. Word error rates are compared, and an in-depth phoneme-level error analysis is conducted to understand where bias is occurring. We primarily focus on bias due to articulation differences in the dataset. Based on our findings, we suggest bias mitigation strategies for ASR development.

* Submitted to INTERSPEECH (IS) 2021. This preprint version differs slightly from the version submitted to IS 2021: Figure 1 is not included in IS 2021

Via

Access Paper or Ask Questions

BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds

Oct 18, 2022
Youshan Zhang, Jialu Li

Figure 1 for BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds

Figure 2 for BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds

Figure 3 for BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds

Figure 4 for BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds

Audio denoising has been explored for decades using both traditional and deep learning-based methods. However, these methods are still limited to either manually added artificial noise or lower denoised audio quality. To overcome these challenges, we collect a large-scale natural noise bird sound dataset. We are the first to transfer the audio denoising problem into an image segmentation problem and propose a deep visual audio denoising (DVAD) model. With a total of 14,120 audio images, we develop an audio ImageMask tool and propose to use a few-shot generalization strategy to label these images. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance. We also show that our method can be easily generalized to speech denoising, audio separation, audio enhancement, and noise estimation.

* WACV 2023

Via

Access Paper or Ask Questions

A Study into Pre-training Strategies for Spoken Language Understanding on Dysarthric Speech

Jun 15, 2021
Pu Wang, Bagher BabaAli, Hugo Van hamme

Figure 1 for A Study into Pre-training Strategies for Spoken Language Understanding on Dysarthric Speech

Figure 2 for A Study into Pre-training Strategies for Spoken Language Understanding on Dysarthric Speech

Figure 3 for A Study into Pre-training Strategies for Spoken Language Understanding on Dysarthric Speech

Figure 4 for A Study into Pre-training Strategies for Spoken Language Understanding on Dysarthric Speech

End-to-end (E2E) spoken language understanding (SLU) systems avoid an intermediate textual representation by mapping speech directly into intents with slot values. This approach requires considerable domain-specific training data. In low-resource scenarios this is a major concern, e.g., in the present study dealing with SLU for dysarthric speech. Pretraining part of the SLU model for automatic speech recognition targets helps but no research has shown to which extent SLU on dysarthric speech benefits from knowledge transferred from other dysarthric speech tasks. This paper investigates the efficiency of pre-training strategies for SLU tasks on dysarthric speech. The designed SLU system consists of a TDNN acoustic model for feature encoding and a capsule network for intent and slot decoding. The acoustic model is pre-trained in two stages: initialization with a corpus of normal speech and finetuning on a mixture of dysarthric and normal speech. By introducing the intelligibility score as a metric of the impairment severity, this paper quantitatively analyzes the relation between generalization and pathology severity for dysarthric speech.

* Accepted by Interspeech 2021

Via

Access Paper or Ask Questions

Noisy-target Training: A Training Strategy for DNN-based Speech Enhancement without Clean Speech

Jan 21, 2021
Takuya Fujimura, Yuma Koizumi, Kohei Yatabe, Ryoichi Miyazaki

Figure 1 for Noisy-target Training: A Training Strategy for DNN-based Speech Enhancement without Clean Speech

Figure 2 for Noisy-target Training: A Training Strategy for DNN-based Speech Enhancement without Clean Speech

Figure 3 for Noisy-target Training: A Training Strategy for DNN-based Speech Enhancement without Clean Speech

Figure 4 for Noisy-target Training: A Training Strategy for DNN-based Speech Enhancement without Clean Speech

Deep neural network (DNN)-based speech enhancement ordinarily requires clean speech signals as the training target. However, collecting clean signals is very costly because they must be recorded in a studio. This requirement currently restricts the amount of training data for speech enhancement less than 1/1000 of that of speech recognition which does not need clean signals. Increasing the amount of training data is important for improving the performance, and hence the requirement of clean signals should be relaxed. In this paper, we propose a training strategy that does not require clean signals. The proposed method only utilizes noisy signals for training, which enables us to use a variety of speech signals in the wild. Our experimental results showed that the proposed method can achieve the performance similar to that of a DNN trained with clean signals.

Via

Access Paper or Ask Questions

AECMOS: A speech quality assessment metric for echo impairment

Oct 08, 2021
Marju Purin, Sten Sootla, Mateja Sponza, Ando Saabas, Ross Cutler

Figure 1 for AECMOS: A speech quality assessment metric for echo impairment

Figure 2 for AECMOS: A speech quality assessment metric for echo impairment

Figure 3 for AECMOS: A speech quality assessment metric for echo impairment

Figure 4 for AECMOS: A speech quality assessment metric for echo impairment

Traditionally, the quality of acoustic echo cancellers is evaluated using intrusive speech quality assessment measures such as ERLE \cite{g168} and PESQ \cite{p862}, or by carrying out subjective laboratory tests. Unfortunately, the former are not well correlated with human subjective measures, while the latter are time and resource consuming to carry out. We provide a new tool for speech quality assessment for echo impairment which can be used to evaluate the performance of acoustic echo cancellers. More precisely, we develop a neural network model to evaluate call quality degradations in two separate categories: echo and degradations from other sources. We show that our model is accurate as measured by correlation with human subjective quality ratings. Our tool can be used effectively to stack rank echo cancellation models. AECMOS is being made publicly available as an Azure service.

Via

Access Paper or Ask Questions

UWSpeech: Speech to Speech Translation for Unwritten Languages

Jun 14, 2020
Chen Zhang, Xu Tan, Yi Ren, Tao Qin, Kejun Zhang, Tie-Yan Liu

Figure 1 for UWSpeech: Speech to Speech Translation for Unwritten Languages

Figure 2 for UWSpeech: Speech to Speech Translation for Unwritten Languages

Figure 3 for UWSpeech: Speech to Speech Translation for Unwritten Languages

Figure 4 for UWSpeech: Speech to Speech Translation for Unwritten Languages

Existing speech to speech translation systems heavily rely on the text of target language: they usually translate source language either to target text and then synthesize target speech from text, or directly to target speech with target text for auxiliary training. However, those methods cannot be applied to unwritten target languages, which have no written text or phoneme available. In this paper, we develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter, and then translates source-language speech into target discrete tokens with a translator, and finally synthesizes target speech from target discrete tokens with an inverter. We propose a method called XL-VAE, which enhances vector quantized variational autoencoder (VQ-VAE) with cross-lingual (XL) speech recognition, to train the converter and inverter of UWSpeech jointly. Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively, which demonstrate the advantages and potentials of UWSpeech.

Via

Access Paper or Ask Questions

Is Attention always needed? A Case Study on Language Identification from Speech

Oct 05, 2021
Atanu Mandal, Santanu Pal, Indranil Dutta, Mahidas Bhattacharya, Sudip Kumar Naskar

Figure 1 for Is Attention always needed? A Case Study on Language Identification from Speech

Figure 2 for Is Attention always needed? A Case Study on Language Identification from Speech

Figure 3 for Is Attention always needed? A Case Study on Language Identification from Speech

Figure 4 for Is Attention always needed? A Case Study on Language Identification from Speech

Language Identification (LID), a recommended initial step to Automatic Speech Recognition (ASR), is used to detect a spoken language from audio specimens. In state-of-the-art systems capable of multilingual speech processing, however, users have to explicitly set one or more languages before using them. LID, therefore, plays a very important role in situations where ASR based systems cannot parse the uttered language in multilingual contexts causing failure in speech recognition. We propose an attention based convolutional recurrent neural network (CRNN with Attention) that works on Mel-frequency Cepstral Coefficient (MFCC) features of audio specimens. Additionally, we reproduce some state-of-the-art approaches, namely Convolutional Neural Network (CNN) and Convolutional Recurrent Neural Network (CRNN), and compare them to our proposed method. We performed extensive evaluation on thirteen different Indian languages and our model achieves classification accuracy over 98%. Our LID model is robust to noise and provides 91.2% accuracy in a noisy scenario. The proposed model is easily extensible to new languages.

* Submitted to ACM Transactions on Asian and Low-Resource Language Information Processing

Via

Access Paper or Ask Questions

Enhancing audio quality for expressive Neural Text-to-Speech

Aug 13, 2021
Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba, Viacheslav Klimkov

Figure 1 for Enhancing audio quality for expressive Neural Text-to-Speech

Figure 2 for Enhancing audio quality for expressive Neural Text-to-Speech

Figure 3 for Enhancing audio quality for expressive Neural Text-to-Speech

Figure 4 for Enhancing audio quality for expressive Neural Text-to-Speech

Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio and its signal quality. In this paper, we present a set of techniques that can be leveraged to enhance the signal quality of a highly-expressive voice without the use of additional data. The proposed techniques include: tuning the autoregressive loop's granularity during training; using Generative Adversarial Networks in acoustic modelling; and the use of Variational Auto-Encoders in both the acoustic model and the neural vocoder. We show that, when combined, these techniques greatly closed the gap in perceived naturalness between the baseline system and recordings by 39% in terms of MUSHRA scores for an expressive celebrity voice.

* 6 pages, 4 figures, 2 tables, SSW 2021

Via

Access Paper or Ask Questions