Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech recognition": models, code, and papers

Adversarial Example Devastation and Detection on Speech Recognition System by Adding Random Noise

Aug 31, 2021
Mingyu Dong, Diqun Yan, Yongkang Gong, Rangding Wang

Figure 1 for Adversarial Example Devastation and Detection on Speech Recognition System by Adding Random Noise

Figure 2 for Adversarial Example Devastation and Detection on Speech Recognition System by Adding Random Noise

Figure 3 for Adversarial Example Devastation and Detection on Speech Recognition System by Adding Random Noise

Figure 4 for Adversarial Example Devastation and Detection on Speech Recognition System by Adding Random Noise

The automatic speech recognition (ASR) system based on deep neural network is easy to be attacked by an adversarial example due to the vulnerability of neural network, which is a hot topic in recent years. The adversarial example does harm to the ASR system, especially if the common-dependent ASR goes wrong, it will lead to serious consequences. To improve the robustness and security of the ASR system, the defense method against adversarial examples must be proposed. Based on this idea, we propose an algorithm of devastation and detection on adversarial examples which can attack the current advanced ASR system. We choose advanced text-dependent and command-dependent ASR system as our target system. Generating adversarial examples by the OPT on text-dependent ASR and the GA-based algorithm on command-dependent ASR. The main idea of our method is input transformation of the adversarial examples. Different random intensities and kinds of noise are added to the adversarial examples to devastate the perturbation previously added to the normal examples. From the experimental results, the method performs well. For the devastation of examples, the original speech similarity before and after adding noise can reach 99.68%, the similarity of the adversarial examples can reach 0%, and the detection rate of the adversarial examples can reach 94%.

* 20 pages, 5 figures

Via

Access Paper or Ask Questions

Residual Energy-Based Models for End-to-End Speech Recognition

Mar 25, 2021
Qiujia Li, Yu Zhang, Bo Li, Liangliang Cao, Philip C. Woodland

Figure 1 for Residual Energy-Based Models for End-to-End Speech Recognition

Figure 2 for Residual Energy-Based Models for End-to-End Speech Recognition

Figure 3 for Residual Energy-Based Models for End-to-End Speech Recognition

Figure 4 for Residual Energy-Based Models for End-to-End Speech Recognition

End-to-end models with auto-regressive decoders have shown impressive results for automatic speech recognition (ASR). These models formulate the sequence-level probability as a product of the conditional probabilities of all individual tokens given their histories. However, the performance of locally normalised models can be sub-optimal because of factors such as exposure bias. Consequently, the model distribution differs from the underlying data distribution. In this paper, the residual energy-based model (R-EBM) is proposed to complement the auto-regressive ASR model to close the gap between the two distributions. Meanwhile, R-EBMs can also be regarded as utterance-level confidence estimators, which may benefit many downstream tasks. Experiments on a 100hr LibriSpeech dataset show that R-EBMs can reduce the word error rates (WERs) by 8.2%/6.7% while improving areas under precision-recall curves of confidence scores by 12.6%/28.4% on test-clean/test-other sets. Furthermore, on a state-of-the-art model using self-supervised learning (wav2vec 2.0), R-EBMs still significantly improves both the WER and confidence estimation performance.

* Submitted to Interspeech 2021

Via

Access Paper or Ask Questions

LiteLSTM Architecture Based on Weights Sharing for Recurrent Neural Networks

Jan 12, 2023
Nelly Elsayed, Zag ElSayed, Anthony S. Maida

Figure 1 for LiteLSTM Architecture Based on Weights Sharing for Recurrent Neural Networks

Figure 2 for LiteLSTM Architecture Based on Weights Sharing for Recurrent Neural Networks

Figure 3 for LiteLSTM Architecture Based on Weights Sharing for Recurrent Neural Networks

Figure 4 for LiteLSTM Architecture Based on Weights Sharing for Recurrent Neural Networks

Long short-term memory (LSTM) is one of the robust recurrent neural network architectures for learning sequential data. However, it requires considerable computational power to learn and implement both software and hardware aspects. This paper proposed a novel LiteLSTM architecture based on reducing the LSTM computation components via the weights sharing concept to reduce the overall architecture computation cost and maintain the architecture performance. The proposed LiteLSTM can be significant for processing large data where time-consuming is crucial while hardware resources are limited, such as the security of IoT devices and medical data processing. The proposed model was evaluated and tested empirically on three different datasets from the computer vision, cybersecurity, speech emotion recognition domains. The proposed LiteLSTM has comparable accuracy to the other state-of-the-art recurrent architecture while using a smaller computation budget.

* Under the second reviewing round in the SN Computer Science Journal. Extended version of the LiteLSTM Architecture for Deep Recurrent Neural Networks paper that have been published in the IEEE ISCAS 2022 conference. arXiv admin note: substantial text overlap with arXiv:2201.11624

Via

Access Paper or Ask Questions

U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition

Jul 07, 2021
Di Wu, Binbin Zhang, Chao Yang, Zhendong Peng, Wenjing Xia, Xiaoyu Chen, Xin Lei

Figure 1 for U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition

Figure 2 for U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition

Figure 3 for U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition

Figure 4 for U2++: Unified Two-pass Bidirectional End-to-end Model for Speech Recognition

The unified streaming and non-streaming two-pass (U2) end-to-end model for speech recognition has shown great performance in terms of streaming capability, accuracy, real-time factor (RTF), and latency. In this paper, we present U2++, an enhanced version of U2 to further improve the accuracy. The core idea of U2++ is to use the forward and the backward information of the labeling sequences at the same time at training to learn richer information, and combine the forward and backward prediction at decoding to give more accurate recognition results. We also proposed a new data augmentation method called SpecSub to help the U2++ model to be more accurate and robust. Our experiments show that, compared with U2, U2++ shows faster convergence at training, better robustness to the decoding method, as well as consistent 5\% - 8\% word error rate reduction gain over U2. On the experiment of AISHELL-1, we achieve a 4.63\% character error rate (CER) with a non-streaming setup and 5.05\% with a streaming setup with 320ms latency by U2++. To the best of our knowledge, 5.05\% is the best-published streaming result on the AISHELL-1 test set.

Via

Access Paper or Ask Questions

On using the UA-Speech and TORGO databases to validate automatic dysarthric speech classification approaches

Nov 16, 2022
Guilherme Schu, Parvaneh Janbakhshi, Ina Kodrasi

Figure 1 for On using the UA-Speech and TORGO databases to validate automatic dysarthric speech classification approaches

Figure 2 for On using the UA-Speech and TORGO databases to validate automatic dysarthric speech classification approaches

Figure 3 for On using the UA-Speech and TORGO databases to validate automatic dysarthric speech classification approaches

Figure 4 for On using the UA-Speech and TORGO databases to validate automatic dysarthric speech classification approaches

Although the UA-Speech and TORGO databases of control and dysarthric speech are invaluable resources made available to the research community with the objective of developing robust automatic speech recognition systems, they have also been used to validate a considerable number of automatic dysarthric speech classification approaches. Such approaches typically rely on the underlying assumption that recordings from control and dysarthric speakers are collected in the same noiseless environment using the same recording setup. In this paper, we show that this assumption is violated for the UA-Speech and TORGO databases. Using voice activity detection to extract speech and non-speech segments, we show that the majority of state-of-the-art dysarthria classification approaches achieve the same or a considerably better performance when using the non-speech segments of these databases than when using the speech segments. These results demonstrate that such approaches trained and validated on the UA-Speech and TORGO databases are potentially learning characteristics of the recording environment or setup rather than dysarthric speech characteristics. We hope that these results raise awareness in the research community about the importance of the quality of recordings when developing and evaluating automatic dysarthria classification approaches.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

On Out-of-Distribution Detection for Audio with Deep Nearest Neighbors

Oct 27, 2022
Zaharah Bukhsh, Aaqib Saeed

Figure 1 for On Out-of-Distribution Detection for Audio with Deep Nearest Neighbors

Figure 2 for On Out-of-Distribution Detection for Audio with Deep Nearest Neighbors

Figure 3 for On Out-of-Distribution Detection for Audio with Deep Nearest Neighbors

Figure 4 for On Out-of-Distribution Detection for Audio with Deep Nearest Neighbors

Out-of-distribution (OOD) detection is concerned with identifying data points that do not belong to the same distribution as the model's training data. For the safe deployment of predictive models in a real-world environment, it is critical to avoid making confident predictions on OOD inputs as it can lead to potentially dangerous consequences. However, OOD detection largely remains an under-explored area in the audio (and speech) domain. This is despite the fact that audio is a central modality for many tasks, such as speaker diarization, automatic speech recognition, and sound event detection. To address this, we propose to leverage feature-space of the model with deep k-nearest neighbors to detect OOD samples. We show that this simple and flexible method effectively detects OOD inputs across a broad category of audio (and speech) datasets. Specifically, it improves the false positive rate (FPR@TPR95) by 17% and the AUROC score by 7% than other prior techniques.

Via

Access Paper or Ask Questions

Smart Speech Segmentation using Acousto-Linguistic Features with look-ahead

Oct 27, 2022
Piyush Behre, Naveen Parihar, Sharman Tan, Amy Shah, Eva Sharma, Geoffrey Liu, Shuangyu Chang, Hosam Khalil, Chris Basoglu, Sayan Pathak

Figure 1 for Smart Speech Segmentation using Acousto-Linguistic Features with look-ahead

Figure 2 for Smart Speech Segmentation using Acousto-Linguistic Features with look-ahead

Figure 3 for Smart Speech Segmentation using Acousto-Linguistic Features with look-ahead

Figure 4 for Smart Speech Segmentation using Acousto-Linguistic Features with look-ahead

Segmentation for continuous Automatic Speech Recognition (ASR) has traditionally used silence timeouts or voice activity detectors (VADs), which are both limited to acoustic features. This segmentation is often overly aggressive, given that people naturally pause to think as they speak. Consequently, segmentation happens mid-sentence, hindering both punctuation and downstream tasks like machine translation for which high-quality segmentation is critical. Model-based segmentation methods that leverage acoustic features are powerful, but without an understanding of the language itself, these approaches are limited. We present a hybrid approach that leverages both acoustic and language information to improve segmentation. Furthermore, we show that including one word as a look-ahead boosts segmentation quality. On average, our models improve segmentation-F0.5 score by 9.8% over baseline. We show that this approach works for multiple languages. For the downstream task of machine translation, it improves the translation BLEU score by an average of 1.05 points.

Via

Access Paper or Ask Questions

Automatic Severity Assessment of Dysarthric speech by using Self-supervised Model with Multi-task Learning

Oct 27, 2022
Eun Jung Yeo, Kwanghee Choi, Sunhee Kim, Minhwa Chung

Figure 1 for Automatic Severity Assessment of Dysarthric speech by using Self-supervised Model with Multi-task Learning

Figure 2 for Automatic Severity Assessment of Dysarthric speech by using Self-supervised Model with Multi-task Learning

Figure 3 for Automatic Severity Assessment of Dysarthric speech by using Self-supervised Model with Multi-task Learning

Figure 4 for Automatic Severity Assessment of Dysarthric speech by using Self-supervised Model with Multi-task Learning

Automatic assessment of dysarthric speech is essential for sustained treatments and rehabilitation. However, obtaining atypical speech is challenging, often leading to data scarcity issues. To tackle the problem, we propose a novel automatic severity assessment method for dysarthric speech, using the self-supervised model in conjunction with multi-task learning. Wav2vec 2.0 XLS-R is jointly trained for two different tasks: severity level classification and an auxilary automatic speech recognition (ASR). For the baseline experiments, we employ hand-crafted features such as eGeMaps and linguistic features, and SVM, MLP, and XGBoost classifiers. Explored on the Korean dysarthric speech QoLT database, our model outperforms the traditional baseline methods, with a relative percentage increase of 4.79% for classification accuracy. In addition, the proposed model surpasses the model trained without ASR head, achieving 10.09% relative percentage improvements. Furthermore, we present how multi-task learning affects the severity classification performance by analyzing the latent representations and regularization effect.

Via

Access Paper or Ask Questions

TRScore: A Novel GPT-based Readability Scorer for ASR Segmentation and Punctuation model evaluation and selection

Oct 27, 2022
Piyush Behre, Sharman Tan, Amy Shah, Harini Kesavamoorthy, Shuangyu Chang, Fei Zuo, Chris Basoglu, Sayan Pathak

Figure 1 for TRScore: A Novel GPT-based Readability Scorer for ASR Segmentation and Punctuation model evaluation and selection

Figure 2 for TRScore: A Novel GPT-based Readability Scorer for ASR Segmentation and Punctuation model evaluation and selection

Figure 3 for TRScore: A Novel GPT-based Readability Scorer for ASR Segmentation and Punctuation model evaluation and selection

Figure 4 for TRScore: A Novel GPT-based Readability Scorer for ASR Segmentation and Punctuation model evaluation and selection

Punctuation and Segmentation are key to readability in Automatic Speech Recognition (ASR), often evaluated using F1 scores that require high-quality human transcripts and do not reflect readability well. Human evaluation is expensive, time-consuming, and suffers from large inter-observer variability, especially in conversational speech devoid of strict grammatical structures. Large pre-trained models capture a notion of grammatical structure. We present TRScore, a novel readability measure using the GPT model to evaluate different segmentation and punctuation systems. We validate our approach with human experts. Additionally, our approach enables quantitative assessment of text post-processing techniques such as capitalization, inverse text normalization (ITN), and disfluency on overall readability, which traditional word error rate (WER) and slot error rate (SER) metrics fail to capture. TRScore is strongly correlated to traditional F1 and human readability scores, with Pearson's correlation coefficients of 0.67 and 0.98, respectively. It also eliminates the need for human transcriptions for model selection.

Via

Access Paper or Ask Questions

Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

Oct 27, 2022
Yuanzhe Chen, Ming Tu, Tang Li, Xin Li, Qiuqiang Kong, Jiaxin Li, Zhichao Wang, Qiao Tian, Yuping Wang, Yuxuan Wang

Figure 1 for Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

Figure 2 for Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

Figure 3 for Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

Figure 4 for Streaming Voice Conversion Via Intermediate Bottleneck Features And Non-streaming Teacher Guidance

Streaming voice conversion (VC) is the task of converting the voice of one person to another in real-time. Previous streaming VC methods use phonetic posteriorgrams (PPGs) extracted from automatic speech recognition (ASR) systems to represent speaker-independent information. However, PPGs lack the prosody and vocalization information of the source speaker, and streaming PPGs contain undesired leaked timbre of the source speaker. In this paper, we propose to use intermediate bottleneck features (IBFs) to replace PPGs. VC systems trained with IBFs retain more prosody and vocalization information of the source speaker. Furthermore, we propose a non-streaming teacher guidance (TG) framework that addresses the timbre leakage problem. Experiments show that our proposed IBFs and the TG framework achieve a state-of-the-art streaming VC naturalness of 3.85, a content consistency of 3.77, and a timbre similarity of 3.77 under a future receptive field of 160 ms which significantly outperform previous streaming VC systems.

* The paper has been submitted to ICASSP2023

Via

Access Paper or Ask Questions