Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech recognition": models, code, and papers

Algorithms for Speech Recognition and Language Processing

Sep 17, 1996
Mehryar Mohri, Michael Riley, Richard Sproat

Speech processing requires very efficient methods and algorithms. Finite-state transducers have been shown recently both to constitute a very useful abstract model and to lead to highly efficient time and space algorithms in this field. We present these methods and algorithms and illustrate them in the case of speech recognition. In addition to classical techniques, we describe many new algorithms such as minimization, global and local on-the-fly determinization of weighted automata, and efficient composition of transducers. These methods are currently used in large vocabulary speech recognition systems. We then show how the same formalism and algorithms can be used in text-to-speech applications and related areas of language processing such as morphology, syntax, and local grammars, in a very efficient way. The tutorial is self-contained and requires no specific computational or linguistic knowledge other than classical results.

* Postscript file tar-compressed and uuencoded, 189 pages

Via

Access Paper or Ask Questions

Enhanced Robot Speech Recognition Using Biomimetic Binaural Sound Source Localization

Feb 13, 2019
Jorge, Davila-Chacon, Jindong, Liu, Stefan, Wermter

Figure 1 for Enhanced Robot Speech Recognition Using Biomimetic Binaural Sound Source Localization

Figure 2 for Enhanced Robot Speech Recognition Using Biomimetic Binaural Sound Source Localization

Figure 3 for Enhanced Robot Speech Recognition Using Biomimetic Binaural Sound Source Localization

Figure 4 for Enhanced Robot Speech Recognition Using Biomimetic Binaural Sound Source Localization

Inspired by the behavior of humans talking in noisy environments, we propose an embodied embedded cognition approach to improve automatic speech recognition (ASR) systems for robots in challenging environments, such as with ego noise, using binaural sound source localization (SSL). The approach is verified by measuring the impact of SSL with a humanoid robot head on the performance of an ASR system. More specifically, a robot orients itself toward the angle where the signal-to-noise ratio (SNR) of speech is maximized for one microphone before doing an ASR task. First, a spiking neural network inspired by the midbrain auditory system based on our previous work is applied to calculate the sound signal angle. Then, a feedforward neural network is used to handle high levels of ego noise and reverberation in the signal. Finally, the sound signal is fed into an ASR system. For ASR, we use a system developed by our group and compare its performance with and without the support from SSL. We test our SSL and ASR systems on two humanoid platforms with different structural and material properties. With our approach we halve the sentence error rate with respect to the common downmixing of both channels. Surprisingly, the ASR performance is more than two times better when the angle between the humanoid head and the sound source allows sound waves to be reflected most intensely from the pinna to the ear microphone, rather than when sound waves arrive perpendicularly to the membrane.

* IEEE Transactions on Neural Networks and Learning Systems (Volume: 30, Issue: 1, Jan. 2019)

Via

Access Paper or Ask Questions

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Feb 07, 2022
Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli

Figure 1 for data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Figure 2 for data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Figure 3 for data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Figure 4 for data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.

Via

Access Paper or Ask Questions

The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems

Jul 13, 2020
Hadi Abdullah, Kevin Warren, Vincent Bindschaedler, Nicolas Papernot, Patrick Traynor

Figure 1 for The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems

Figure 2 for The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems

Figure 3 for The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems

Figure 4 for The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems

Speech and speaker recognition systems are employed in a variety of applications, from personal assistants to telephony surveillance and biometric authentication. The wide deployment of these systems has been made possible by the improved accuracy in neural networks. Like other systems based on neural networks, recent research has demonstrated that speech and speaker recognition systems are vulnerable to attacks using manipulated inputs. However, as we demonstrate in this paper, the end-to-end architecture of speech and speaker systems and the nature of their inputs make attacks and defenses against them substantially different than those in the image space. We demonstrate this first by systematizing existing research in this space and providing a taxonomy through which the community can evaluate future work. We then demonstrate experimentally that attacks against these models almost universally fail to transfer. In so doing, we argue that substantial additional work is required to provide adequate mitigations in this space.

Via

Access Paper or Ask Questions

Key-Sparse Transformer with Cascaded Cross-Attention Block for Multimodal Speech Emotion Recognition

Jun 22, 2021
Weidong Chen, Xiaofeng Xing, Xiangmin Xu, Jichen Yang, Jianxin Pang

Figure 1 for Key-Sparse Transformer with Cascaded Cross-Attention Block for Multimodal Speech Emotion Recognition

Figure 2 for Key-Sparse Transformer with Cascaded Cross-Attention Block for Multimodal Speech Emotion Recognition

Figure 3 for Key-Sparse Transformer with Cascaded Cross-Attention Block for Multimodal Speech Emotion Recognition

Figure 4 for Key-Sparse Transformer with Cascaded Cross-Attention Block for Multimodal Speech Emotion Recognition

Speech emotion recognition is a challenging and important research topic that plays a critical role in human-computer interaction. Multimodal inputs can improve the performance as more emotional information is used for recognition. However, existing studies learnt all the information in the sample while only a small portion of it is about emotion. Moreover, under the multimodal framework, the interaction between different modalities is shallow and insufficient. In this paper, a keysparse Transformer is proposed for efficient SER by only focusing on emotion related information. Furthermore, a cascaded cross-attention block, which is specially designed for multimodal framework, is introduced to achieve deep interaction between different modalities. The proposed method is evaluated by IEMOCAP corpus and the experimental results show that the proposed method gives better performance than the state-of-theart approaches.

Via

Access Paper or Ask Questions

Quantitative phase and absorption contrast imaging

Mar 23, 2022
Miguel Moscoso, Alexei Novikov, George Papanicolaou, Chrysoula Tsogka

Figure 1 for Quantitative phase and absorption contrast imaging

Figure 2 for Quantitative phase and absorption contrast imaging

Figure 3 for Quantitative phase and absorption contrast imaging

Figure 4 for Quantitative phase and absorption contrast imaging

Phase retrieval in its most general form is the problem of reconstructing a complex valued function from phaseless information of some transform of that function. This problem arises in various fields such as X-ray crystallography, electron microscopy, coherent diffractive imaging, astronomy, speech recognition, and quantum mechanics. The mathematical and computational analysis of these problems has a long history and a variety of different algorithms has been proposed in the literature. The performance of which usually depends on the constraints imposed on the sought function and the number of measurements. In this paper, we present an algorithm for coherent diffractive imaging with phaseless measurements. The algorithm accounts for both coherent and incoherent wave propagation and allows for reconstructing absorption as well as phase images that quantify the attenuation and the refraction of the waves when they go through an object. The algorithm requires coherent or partially coherent illumination, and several detectors to record the intensity of the distorted wave that passes through the object under inspection. To obtain enough information for imaging, a series of masks are introduced between the source and the object that create a diversity of illumination patterns.

Via

Access Paper or Ask Questions

Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languages

Dec 15, 2018
Siddique Latif, Adnan Qayyum, Muhammad Usman, Junaid Qadir

Figure 1 for Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languages

Figure 2 for Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languages

Figure 3 for Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languages

Figure 4 for Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languages

Cross-lingual speech emotion recognition is an important task for practical applications. The performance of automatic speech emotion recognition systems degrades in cross-corpus scenarios, particularly in scenarios involving multiple languages or a previously unseen language such as Urdu for which limited or no data is available. In this study, we investigate the problem of cross-lingual emotion recognition for Urdu language and contribute URDU---the first ever spontaneous Urdu-language speech emotion database. Evaluations are performed using three different Western languages against Urdu and experimental results on different possible scenarios suggest various interesting aspects for designing more adaptive emotion recognition system for such limited languages. In results, selecting training instances of multiple languages can deliver comparable results to baseline and augmentation a fraction of testing language data while training can help to boost accuracy for speech emotion recognition. URDU data is publicly available for further research.

* 6

Via

Access Paper or Ask Questions

Integrating HMM-Based Speech Recognition With Direct Manipulation In A Multimodal Korean Natural Language Interface

Nov 18, 1996
Geunbae Lee, Jong-Hyeok Lee, Sangeok Kim

Figure 1 for Integrating HMM-Based Speech Recognition With Direct Manipulation In A Multimodal Korean Natural Language Interface

This paper presents a HMM-based speech recognition engine and its integration into direct manipulation interfaces for Korean document editor. Speech recognition can reduce typical tedious and repetitive actions which are inevitable in standard GUIs (graphic user interfaces). Our system consists of general speech recognition engine called ABrain {Auditory Brain} and speech commandable document editor called SHE {Simple Hearing Editor}. ABrain is a phoneme-based speech recognition engine which shows up to 97% of discrete command recognition rate. SHE is a EuroBridge widget-based document editor that supports speech commands as well as direct manipulation interfaces.

* 6 pages, ps file, presented at icmi96 (Bejing)

Via

Access Paper or Ask Questions

The IBM 2015 English Conversational Telephone Speech Recognition System

May 21, 2015
George Saon, Hong-Kwang J. Kuo, Steven Rennie, Michael Picheny

Figure 1 for The IBM 2015 English Conversational Telephone Speech Recognition System

Figure 2 for The IBM 2015 English Conversational Telephone Speech Recognition System

Figure 3 for The IBM 2015 English Conversational Telephone Speech Recognition System

Figure 4 for The IBM 2015 English Conversational Telephone Speech Recognition System

We describe the latest improvements to the IBM English conversational telephone speech recognition system. Some of the techniques that were found beneficial are: maxout networks with annealed dropout rates; networks with a very large number of outputs trained on 2000 hours of data; joint modeling of partially unfolded recurrent neural networks and convolutional nets by combining the bottleneck and output layers and retraining the resulting model; and lastly, sophisticated language model rescoring with exponential and neural network LMs. These techniques result in an 8.0% word error rate on the Switchboard part of the Hub5-2000 evaluation test set which is 23% relative better than our previous best published result.

* Submitted to Interspeech 2015

Via

Access Paper or Ask Questions

Light Gated Recurrent Units for Speech Recognition

Mar 26, 2018
Mirco Ravanelli, Philemon Brakel, Maurizio Omologo, Yoshua Bengio

Figure 1 for Light Gated Recurrent Units for Speech Recognition

Figure 2 for Light Gated Recurrent Units for Speech Recognition

Figure 3 for Light Gated Recurrent Units for Speech Recognition

Figure 4 for Light Gated Recurrent Units for Speech Recognition

A field that has directly benefited from the recent advances in deep learning is Automatic Speech Recognition (ASR). Despite the great achievements of the past decades, however, a natural and robust human-machine speech interaction still appears to be out of reach, especially in challenging environments characterized by significant noise and reverberation. To improve robustness, modern speech recognizers often employ acoustic models based on Recurrent Neural Networks (RNNs), that are naturally able to exploit large time contexts and long-term speech modulations. It is thus of great interest to continue the study of proper techniques for improving the effectiveness of RNNs in processing speech signals. In this paper, we revise one of the most popular RNN models, namely Gated Recurrent Units (GRUs), and propose a simplified architecture that turned out to be very effective for ASR. The contribution of this work is two-fold: First, we analyze the role played by the reset gate, showing that a significant redundancy with the update gate occurs. As a result, we propose to remove the former from the GRU design, leading to a more efficient and compact single-gate model. Second, we propose to replace hyperbolic tangent with ReLU activations. This variation couples well with batch normalization and could help the model learn long-term dependencies without numerical issues. Results show that the proposed architecture, called Light GRU (Li-GRU), not only reduces the per-epoch training time by more than 30% over a standard GRU, but also consistently improves the recognition accuracy across different tasks, input features, noisy conditions, as well as across different ASR paradigms, ranging from standard DNN-HMM speech recognizers to end-to-end CTC models.

* IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 92-102, April 2018
* Copyright 2018 IEEE

Via

Access Paper or Ask Questions