Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech recognition": models, code, and papers

Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers

Feb 09, 2021
Shucong Zhang, Cong-Thanh Do, Rama Doddipatla, Erfan Loweimi, Peter Bell, Steve Renals

Figure 1 for Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers

Figure 2 for Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers

Figure 3 for Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers

Figure 4 for Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers

Although the lower layers of a deep neural network learn features which are transferable across datasets, these layers are not transferable within the same dataset. That is, in general, freezing the trained feature extractor (the lower layers) and retraining the classifier (the upper layers) on the same dataset leads to worse performance. In this paper, for the first time, we show that the frozen classifier is transferable within the same dataset. We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers. We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks. The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.

* Accepted by ICASSP 2021

Via

Access Paper or Ask Questions

Sequential End-to-End Intent and Slot Label Classification and Localization

Jun 08, 2021
Yiran Cao, Nihal Potdar, Anderson R. Avila

Figure 1 for Sequential End-to-End Intent and Slot Label Classification and Localization

Figure 2 for Sequential End-to-End Intent and Slot Label Classification and Localization

Figure 3 for Sequential End-to-End Intent and Slot Label Classification and Localization

Figure 4 for Sequential End-to-End Intent and Slot Label Classification and Localization

Human-computer interaction (HCI) is significantly impacted by delayed responses from a spoken dialogue system. Hence, end-to-end (e2e) spoken language understanding (SLU) solutions have recently been proposed to decrease latency. Such approaches allow for the extraction of semantic information directly from the speech signal, thus bypassing the need for a transcript from an automatic speech recognition (ASR) system. In this paper, we propose a compact e2e SLU architecture for streaming scenarios, where chunks of the speech signal are processed continuously to predict intent and slot values. Our model is based on a 3D convolutional neural network (3D-CNN) and a unidirectional long short-term memory (LSTM). We compare the performance of two alignment-free losses: the connectionist temporal classification (CTC) method and its adapted version, namely connectionist temporal localization (CTL). The latter performs not only the classification but also localization of sequential audio events. The proposed solution is evaluated on the Fluent Speech Command dataset and results show our model ability to process incoming speech signal, reaching accuracy as high as 98.97 % for CTC and 98.78 % for CTL on single-label classification, and as high as 95.69 % for CTC and 95.28 % for CTL on two-label prediction.

* Accepted at Interspeech 2021

Via

Access Paper or Ask Questions

Exploring Targeted Universal Adversarial Perturbations to End-to-end ASR Models

Apr 06, 2021
Zhiyun Lu, Wei Han, Yu Zhang, Liangliang Cao

Figure 1 for Exploring Targeted Universal Adversarial Perturbations to End-to-end ASR Models

Figure 2 for Exploring Targeted Universal Adversarial Perturbations to End-to-end ASR Models

Figure 3 for Exploring Targeted Universal Adversarial Perturbations to End-to-end ASR Models

Figure 4 for Exploring Targeted Universal Adversarial Perturbations to End-to-end ASR Models

Although end-to-end automatic speech recognition (e2e ASR) models are widely deployed in many applications, there have been very few studies to understand models' robustness against adversarial perturbations. In this paper, we explore whether a targeted universal perturbation vector exists for e2e ASR models. Our goal is to find perturbations that can mislead the models to predict the given targeted transcript such as "thank you" or empty string on any input utterance. We study two different attacks, namely additive and prepending perturbations, and their performances on the state-of-the-art LAS, CTC and RNN-T models. We find that LAS is the most vulnerable to perturbations among the three models. RNN-T is more robust against additive perturbations, especially on long utterances. And CTC is robust against both additive and prepending perturbations. To attack RNN-T, we find prepending perturbation is more effective than the additive perturbation, and can mislead the models to predict the same short target on utterances of arbitrary length.

* Submitted to INTERSPEECH 2021

Via

Access Paper or Ask Questions

AI4D -- African Language Program

Apr 06, 2021
Kathleen Siminyu, Godson Kalipe, Davor Orlic, Jade Abbott, Vukosi Marivate, Sackey Freshia, Prateek Sibal, Bhanu Neupane, David I. Adelani, Amelia Taylor, Jamiil Toure ALI, Kevin Degila, Momboladji Balogoun, Thierno Ibrahima DIOP, Davis David, Chayma Fourati, Hatem Haddad, Malek Naski

Figure 1 for AI4D -- African Language Program

Figure 2 for AI4D -- African Language Program

Advances in speech and language technologies enable tools such as voice-search, text-to-speech, speech recognition and machine translation. These are however only available for high resource languages like English, French or Chinese. Without foundational digital resources for African languages, which are considered low-resource in the digital context, these advanced tools remain out of reach. This work details the AI4D - African Language Program, a 3-part project that 1) incentivised the crowd-sourcing, collection and curation of language datasets through an online quantitative and qualitative challenge, 2) supported research fellows for a period of 3-4 months to create datasets annotated for NLP tasks, and 3) hosted competitive Machine Learning challenges on the basis of these datasets. Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets through hosting of competitive ML challenges.

Via

Access Paper or Ask Questions

A Review of Recent Advances of Binary Neural Networks for Edge Computing

Nov 24, 2020
Wenyu Zhao, Teli Ma, Xuan Gong, Baochang Zhang, David Doermann

Figure 1 for A Review of Recent Advances of Binary Neural Networks for Edge Computing

Figure 2 for A Review of Recent Advances of Binary Neural Networks for Edge Computing

Figure 3 for A Review of Recent Advances of Binary Neural Networks for Edge Computing

Figure 4 for A Review of Recent Advances of Binary Neural Networks for Edge Computing

Edge computing is promising to become one of the next hottest topics in artificial intelligence because it benefits various evolving domains such as real-time unmanned aerial systems, industrial applications, and the demand for privacy protection. This paper reviews recent advances on binary neural network (BNN) and 1-bit CNN technologies that are well suitable for front-end, edge-based computing. We introduce and summarize existing work and classify them based on gradient approximation, quantization, architecture, loss functions, optimization method, and binary neural architecture search. We also introduce applications in the areas of computer vision and speech recognition and discuss future applications for edge computing.

Via

Access Paper or Ask Questions

Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding

Apr 05, 2021
Suyoun Kim, Abhinav Arora, Duc Le, Ching-Feng Yeh, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer

Figure 1 for Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding

Figure 2 for Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding

Figure 3 for Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding

Figure 4 for Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding

Word Error Rate (WER) has been the predominant metric used to evaluate the performance of automatic speech recognition (ASR) systems. However, WER is sometimes not a good indicator for downstream Natural Language Understanding (NLU) tasks, such as intent recognition, slot filling, and semantic parsing in task-oriented dialog systems. This is because WER takes into consideration only literal correctness instead of semantic correctness, the latter of which is typically more important for these downstream tasks. In this study, we propose a novel Semantic Distance (SemDist) measure as an alternative evaluation metric for ASR systems to address this issue. We define SemDist as the distance between a reference and hypothesis pair in a sentence-level embedding space. To represent the reference and hypothesis as a sentence embedding, we exploit RoBERTa, a state-of-the-art pre-trained deep contextualized language model based on the transformer architecture. We demonstrate the effectiveness of our proposed metric on various downstream tasks, including intent recognition, semantic parsing, and named entity recognition.

* submitted to Interspeech 2021

Via

Access Paper or Ask Questions

Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yoloxóchitl Mixtec

Feb 26, 2021
Jiatong Shi. Jonathan D. Amith, Rey Castillo García, Esteban Guadalupe Sierra, Kevin Duh, Shinji Watanabe

Figure 1 for Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yoloxóchitl Mixtec

Figure 2 for Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yoloxóchitl Mixtec

Figure 3 for Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yoloxóchitl Mixtec

Figure 4 for Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yoloxóchitl Mixtec

"Transcription bottlenecks", created by a shortage of effective human transcribers are one of the main challenges to endangered language (EL) documentation. Automatic speech recognition (ASR) has been suggested as a tool to overcome such bottlenecks. Following this suggestion, we investigated the effectiveness for EL documentation of end-to-end ASR, which unlike Hidden Markov Model ASR systems, eschews linguistic resources but is instead more dependent on large-data settings. We open source a Yolox\'ochitl Mixtec EL corpus. First, we review our method in building an end-to-end ASR system in a way that would be reproducible by the ASR community. We then propose a novice transcription correction task and demonstrate how ASR systems and novice transcribers can work together to improve EL documentation. We believe this combinatory methodology would mitigate the transcription bottleneck and transcriber shortage that hinders EL documentation.

* Accepted by EACL2021

Via

Access Paper or Ask Questions

"I have vxxx bxx connexxxn!": Facing Packet Loss in Deep Speech Emotion Recognition

May 15, 2020
Mostafa M. Mohamed, Björn W. Schuller

Figure 1 for "I have vxxx bxx connexxxn!": Facing Packet Loss in Deep Speech Emotion Recognition

Figure 2 for "I have vxxx bxx connexxxn!": Facing Packet Loss in Deep Speech Emotion Recognition

Figure 3 for "I have vxxx bxx connexxxn!": Facing Packet Loss in Deep Speech Emotion Recognition

Figure 4 for "I have vxxx bxx connexxxn!": Facing Packet Loss in Deep Speech Emotion Recognition

In applications that use emotion recognition via speech, frame-loss can be a severe issue given manifold applications, where the audio stream loses some data frames, for a variety of reasons like low bandwidth. In this contribution, we investigate for the first time the effects of frame-loss on the performance of emotion recognition via speech. Reproducible extensive experiments are reported on the popular RECOLA corpus using a state-of-the-art end-to-end deep neural network, which mainly consists of convolution blocks and recurrent layers. A simple environment based on a Markov Chain model is used to model the loss mechanism based on two main parameters. We explore matched, mismatched, and multi-condition training settings. As one expects, the matched setting yields the best performance, while the mismatched yields the lowest. Furthermore, frame-loss as a data augmentation technique is introduced as a general-purpose strategy to overcome the effects of frame-loss. It can be used during training, and we observed it to produce models that are more robust against frame-loss in run-time environments.

* Submitted to INTERSPEECH 2020. 4 Pages + 1 page for references. 4 Figures and 2 Tables

Via

Access Paper or Ask Questions

Towards Abstractive Grounded Summarization of Podcast Transcripts

Mar 22, 2022
Kaiqiang Song, Chen Li, Xiaoyang Wang, Dong Yu, Fei Liu

Figure 1 for Towards Abstractive Grounded Summarization of Podcast Transcripts

Figure 2 for Towards Abstractive Grounded Summarization of Podcast Transcripts

Figure 3 for Towards Abstractive Grounded Summarization of Podcast Transcripts

Figure 4 for Towards Abstractive Grounded Summarization of Podcast Transcripts

Podcasts have recently shown a rapid rise in popularity. Summarization of podcast transcripts is of practical benefit to both content providers and consumers. It helps consumers to quickly decide whether they will listen to the podcasts and reduces the cognitive load of content providers to write summaries. Nevertheless, podcast summarization faces significant challenges including factual inconsistencies with respect to the inputs. The problem is exacerbated by speech disfluencies and recognition errors in transcripts of spoken language. In this paper, we explore a novel abstractive summarization method to alleviate these challenges. Specifically, our approach learns to produce an abstractive summary while grounding summary segments in specific portions of the transcript to allow for full inspection of summary details. We conduct a series of analyses of the proposed approach on a large podcast dataset and show that the approach can achieve promising results. Grounded summaries bring clear benefits in locating the summary and transcript segments that contain inconsistent information, and hence significantly improve summarization quality in both automatic and human evaluation metrics.

Via

Access Paper or Ask Questions

Sequential Routing Framework: Fully Capsule Network-based Speech Recognition

Jul 23, 2020
Kyungmin Lee, Hyunwhan Joe, Hyeontaek Lim, Kwangyoun Kim, Sungsoo Kim, Chang Woo Han, Hong-Gee Kim

Figure 1 for Sequential Routing Framework: Fully Capsule Network-based Speech Recognition

Figure 2 for Sequential Routing Framework: Fully Capsule Network-based Speech Recognition

Figure 3 for Sequential Routing Framework: Fully Capsule Network-based Speech Recognition

Figure 4 for Sequential Routing Framework: Fully Capsule Network-based Speech Recognition

Capsule networks (CapsNets) have recently gotten attention as alternatives for convolutional neural networks (CNNs) with their greater hierarchical representation capabilities. In this paper, we introduce the sequential routing framework (SRF) which we believe is the first method to adapt a CapsNet-only structure to sequence-to-sequence recognition. In SRF, input sequences are capsulized then sliced by the window size. Each sliced window is classified to a label at the corresponding time through iterative routing mechanisms. Afterwards, training losses are computed using connectionist temporal classification (CTC). During routing, two kinds of information, learnable weights and iteration outputs are shared across the slices. By sharing the information, the required parameter numbers can be controlled by the given window size regardless of the length of sequences. Moreover, the method can minimize decoding speed degradation caused by the routing iterations since it can operate in a non-iterative manner at inference time without dropping accuracy. We empirically proved the validity of our method by performing phoneme sequence recognition tasks on the TIMIT corpus. The proposed method attains an 82.6% phoneme recognition rate. It is 0.8% more accurate than that of CNN-based CTC networks and on par with that of recurrent neural network transducers (RNN-Ts). Even more, the method requires less than half the parameters compared to the two architectures.

* 40 pages, 7 figures (totally 10 figures), submitted to Computer Speech and Language (Only line numbers were removed from the submitted version)

Via

Access Paper or Ask Questions