Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually Grounded Speech

Jun 15, 2020
William N. Havard, Jean-Pierre Chevrot, Laurent Besacier

We investigate the effect of introducing phone, syllable, or word boundaries on the performance of a Model of Visually Grounded Speech and compare the results with a model that does not use any boundary information and with a model that uses random boundaries. We introduce a simple way to introduce such information in an RNN-based model and investigate which type of boundary enables a better mapping between an image and its spoken description. We also explore where, that is, at which level of the network's architecture such information should be introduced. We show that using a segmentation that results in syllable-like or word-like segments and that respects word boundaries are the most efficient. Also, we show that a linguistically informed subsampling is more efficient than a random subsampling. Finally, we show that using a hierarchical segmentation, by first using a phone segmentation and recomposing words from the phone units yields better results than either using a phone or word segmentation in isolation.


  Access Paper or Ask Questions

The Marchex 2018 English Conversational Telephone Speech Recognition System

Nov 05, 2018
Seongjun Hahm, Iroro Orife, Shane Walker, Jason Flaks

In this paper, we describe recent improvements to the production Marchex speech recognition system for our spontaneous customer-to-business telephone conversations. We outline our semi-supervised lattice-free maximum mutual information (LF-MMI) training process which can supervise over full lattices from unlabeled audio. We also elaborate on production-scale text selection techniques for constructing very large conversational language models (LMs). On Marchex English (ME), a modern evaluation set of conversational North American English, for acoustic modeling we report a 3.3% ({agent, caller}:{3.2%, 3.6%}) reduction in absolute word error rate (WER). For language modeling, we observe a separate {1.3%, 1.2%} point reduction on {agent, caller} utterances respectively over the performance of the 2017 production system.

* 5 pages, 1 figure. Submitted to ICASSP 2019 

  Access Paper or Ask Questions

PyKaldi2: Yet another speech toolkit based on Kaldi and PyTorch

Jul 30, 2019
Liang Lu, Xiong Xiao, Zhuo Chen, Yifan Gong

We introduce PyKaldi2 speech recognition toolkit implemented based on Kaldi and PyTorch. While similar toolkits are available built on top of the two, a key feature of PyKaldi2 is sequence training with criteria such as MMI, sMBR and MPE. In particular, we implemented the sequence training module with on-the-fly lattice generation during model training in order to simplify the training pipeline. To address the challenging acoustic environments in real applications, PyKaldi2 also supports on-the-fly noise and reverberation simulation to improve the model robustness. With this feature, it is possible to backpropogate the gradients from the sequence-level loss to the front-end feature extraction module, which, hopefully, can foster more research in the direction of joint front-end and backend learning. We performed benchmark experiments on Librispeech, and show that PyKaldi2 can achieve reasonable recognition accuracy. The toolkit is released under the MIT license.

* 5 pages, 2 figures 

  Access Paper or Ask Questions

Pykaldi2: Yet another speech toolkit based on Kaldi and Pytorch

Jul 12, 2019
Liang Lu, Xiong Xiao, Zhuo Chen, Yifan Gong

We introduce PyKaldi2 speech recognition toolkit implemented based on Kaldi and PyTorch. While similar toolkits are available built on top of the two, a key feature of PyKaldi2 is sequence training with criteria such as MMI, sMBR and MPE. In particular, we implemented the sequence training module with on-the-fly lattice generation during model training in order to simplify the training pipeline. To address the challenging acoustic environments in real applications, PyKaldi2 also supports on-the-fly noise and reverberation simulation to improve the model robustness. With this feature, it is possible to backpropogate the gradients from the sequence-level loss to the front-end feature extraction module, which, hopefully, can foster more research in the direction of joint front-end and backend learning. We performed benchmark experiments on Librispeech, and show that PyKaldi2 can achieve reasonable recognition accuracy. The toolkit is released under the MIT license.

* 5 pages, 2 figures 

  Access Paper or Ask Questions

Enabling On-Device Training of Speech Recognition Models with Federated Dropout

Oct 07, 2021
Dhruv Guliani, Lillian Zhou, Changwan Ryu, Tien-Ju Yang, Harry Zhang, Yonghui Xiao, Francoise Beaufays, Giovanni Motta

Federated learning can be used to train machine learning models on the edge on local data that never leave devices, providing privacy by default. This presents a challenge pertaining to the communication and computation costs associated with clients' devices. These costs are strongly correlated with the size of the model being trained, and are significant for state-of-the-art automatic speech recognition models. We propose using federated dropout to reduce the size of client models while training a full-size model server-side. We provide empirical evidence of the effectiveness of federated dropout, and propose a novel approach to vary the dropout rate applied at each layer. Furthermore, we find that federated dropout enables a set of smaller sub-models within the larger model to independently have low word error rates, making it easier to dynamically adjust the size of the model deployed for inference.

* \c{opyright} 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses 

  Access Paper or Ask Questions

Sampling strategies in Siamese Networks for unsupervised speech representation learning

Aug 23, 2018
Rachid Riad, Corentin Dancette, Julien Karadayi, Neil Zeghidour, Thomas Schatz, Emmanuel Dupoux

Recent studies have investigated siamese network architectures for learning invariant speech representations using same-different side information at the word level. Here we investigate systematically an often ignored component of siamese networks: the sampling procedure (how pairs of same vs. different tokens are selected). We show that sampling strategies taking into account Zipf's Law, the distribution of speakers and the proportions of same and different pairs of words significantly impact the performance of the network. In particular, we show that word frequency compression improves learning across a large range of variations in number of training pairs. This effect does not apply to the same extent to the fully unsupervised setting, where the pairs of same-different words are obtained by spoken term discovery. We apply these results to pairs of words discovered using an unsupervised algorithm and show an improvement on state-of-the-art in unsupervised representation learning using siamese networks.

* Conference paper at Interspeech 2018 

  Access Paper or Ask Questions

A Conformer Based Acoustic Model for Robust Automatic Speech Recognition

Mar 01, 2022
Yufeng Yang, Peidong Wang, DeLiang Wang

This study addresses robust automatic speech recognition (ASR) by introducing a Conformer-based acoustic model. The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation, but employs a Conformer encoder instead of the BLSTM network. The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling. The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus. Coupled with utterance-wise normalization and speaker adaptation, our model achieves $6.25\%$ word error rate, which outperforms the previous best system by $8.4\%$ relatively. In addition, the proposed Conformer-based model is $18.3\%$ smaller in model size and reduces training time by $88.5\%$.

* 5 pages, 1 figure 

  Access Paper or Ask Questions

Class-Conditional Defense GAN Against End-to-End Speech Attacks

Oct 22, 2020
Mohammad Esmaeilpour, Patrick Cardinal, Alessandro Lameiras Koerich

In this paper we propose a novel defense approach against end-to-end adversarial attacks developed to fool advanced speech-to-text systems such as DeepSpeech and Lingvo. Unlike conventional defense approaches, the proposed approach does not directly employ low-level transformations such as autoencoding a given input signal aiming at removing potential adversarial perturbation. Instead of that, we find an optimal input vector for a class conditional generative adversarial network through minimizing the relative chordal distance adjustment between a given test input and the generator network. Then, we reconstruct the 1D signal from the synthesized spectrogram and the original phase information derived from the given input signal. Hence, this reconstruction does not add any extra noise to the signal and according to our experimental results, our defense-GAN considerably outperforms conventional defense algorithms both in terms of word error rate and sentence level recognition accuracy.

* 5 pages 

  Access Paper or Ask Questions

Neural Simultaneous Speech Translation Using Alignment-Based Chunking

May 29, 2020
Patrick Wilken, Tamer Alkhouli, Evgeny Matusov, Pavel Golik

In simultaneous machine translation, the objective is to determine when to produce a partial translation given a continuous stream of source words, with a trade-off between latency and quality. We propose a neural machine translation (NMT) model that makes dynamic decisions when to continue feeding on input or generate output words. The model is composed of two main components: one to dynamically decide on ending a source chunk, and another that translates the consumed chunk. We train the components jointly and in a manner consistent with the inference conditions. To generate chunked training data, we propose a method that utilizes word alignment while also preserving enough context. We compare models with bidirectional and unidirectional encoders of different depths, both on real speech and text input. Our results on the IWSLT 2020 English-to-German task outperform a wait-k baseline by 2.6 to 3.7% BLEU absolute.

* IWSLT 2020 

  Access Paper or Ask Questions

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

Oct 28, 2019
Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, Michael L. Seltzer

We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-attention to enable streaming for Transformer and reduce computational complexity. All experiments are conducted on the public LibriSpeech corpus. The proposed Transformer-Transducer outperforms neural transducer with LSTM/BLSTM networks and achieved word error rates of 6.37 % on the test-clean set and 15.30 % on the test-other set, while remaining streamable, compact with 45.7M parameters for the entire system, and computationally efficient with complexity of O(T), where T is input sequence length.


  Access Paper or Ask Questions

<<
495
496
497
498
499
500
501
502
503
504
505
506
507
>>