Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Speech Analysis for Automatic Mania Assessment in Bipolar Disorder

Feb 05, 2022
Pınar Baki, Heysem Kaya, Elvan Çiftçi, Hüseyin Güleç, Albert Ali Salah

Bipolar disorder is a mental disorder that causes periods of manic and depressive episodes. In this work, we classify recordings from Bipolar Disorder corpus that contain 7 different tasks, into hypomania, mania, and remission classes using only speech features. We perform our experiments on splitted tasks from the interviews. Best results achieved on the model trained with 6th and 7th tasks together gives 0.53 UAR (unweighted average recall) result which is higher than the baseline results of the corpus.

* Conference, 5 pages, in Turkish language 

  Access Paper or Ask Questions

Does Simultaneous Speech Translation need Simultaneous Models?

Apr 20, 2022
Sara Papi, Marco Gaido, Matteo Negri, Marco Turchi

In simultaneous speech translation (SimulST), finding the best trade-off between high translation quality and low latency is a challenging task. To meet the latency constraints posed by the different application scenarios, multiple dedicated SimulST models are usually trained and maintained, generating high computational costs. In this paper, motivated by the increased social and environmental impact caused by these costs, we investigate whether a single model trained offline can serve not only the offline but also the simultaneous task without the need for any additional training or adaptation. Experiments on en->{de, es} indicate that, aside from facilitating the adoption of well-established offline techniques and architectures without affecting latency, the offline solution achieves similar or better translation quality compared to the same model trained in simultaneous settings, as well as being competitive with the SimulST state of the art.

  Access Paper or Ask Questions

Transformer-based language modeling and decoding for conversational speech recognition

Jan 04, 2020
Kareem Nassar

We propose a way to use a transformer-based language model in conversational speech recognition. Specifically, we focus on decoding efficiently in a weighted finite-state transducer framework. We showcase an approach to lattice re-scoring that allows for longer range history captured by a transfomer-based language model and takes advantage of a transformer's ability to avoid computing sequentially.

  Access Paper or Ask Questions

Language learning using Speech to Image retrieval

Sep 09, 2019
Danny Merkx, Stefan L. Frank, Mirjam Ernestus

Humans learn language by interaction with their environment and listening to other humans. It should also be possible for computational models to learn language directly from speech but so far most approaches require text. We improve on existing neural network approaches to create visually grounded embeddings for spoken utterances. Using a combination of a multi-layer GRU, importance sampling, cyclic learning rates, ensembling and vectorial self-attention our results show a remarkable increase in image-caption retrieval performance over previous work. Furthermore, we investigate which layers in the model learn to recognise words in the input. We find that deeper network layers are better at encoding word presence, although the final layer has slightly lower performance. This shows that our visually grounded sentence encoder learns to recognise words from the input even though it is not explicitly trained for word recognition.

* Submitted to InterSpeech 2019 

  Access Paper or Ask Questions

Combining Knowledge Sources to Reorder N-Best Speech Hypothesis Lists

Jul 12, 1994
Manny Rayner, David Carter, Vassilios Digalakis, Patti Price

A simple and general method is described that can combine different knowledge sources to reorder N-best lists of hypotheses produced by a speech recognizer. The method is automatically trainable, acquiring information from both positive and negative examples. Experiments are described in which it was tested on a 1000-utterance sample of unseen ATIS data.

* 13 pages, Latex source. To appear in Proc. HLT '94 

  Access Paper or Ask Questions

Efficient Training of Neural Transducer for Speech Recognition

Apr 22, 2022
Wei Zhou, Wilfried Michel, Ralf Schlüter, Hermann Ney

As one of the most popular sequence-to-sequence modeling approaches for speech recognition, the RNN-Transducer has achieved evolving performance with more and more sophisticated neural network models of growing size and increasing training epochs. While strong computation resources seem to be the prerequisite of training superior models, we try to overcome it by carefully designing a more efficient training pipeline. In this work, we propose an efficient 3-stage progressive training pipeline to build highly-performing neural transducer models from scratch with very limited computation resources in a reasonable short time period. The effectiveness of each stage is experimentally verified on both Librispeech and Switchboard corpora. The proposed pipeline is able to train transducer models approaching state-of-the-art performance with a single GPU in just 2-3 weeks. Our best conformer transducer achieves 4.1% WER on Librispeech test-other with only 35 epochs of training.

* submitted to Interspeech 2022 

  Access Paper or Ask Questions

Spell my name: keyword boosted speech recognition

Oct 06, 2021
Namkyu Jung, Geonmin Kim, Joon Son Chung

Recognition of uncommon words such as names and technical terminology is important to understanding conversations in context. However, the ability to recognise such words remains a challenge in modern automatic speech recognition (ASR) systems. In this paper, we propose a simple but powerful ASR decoding method that can better recognise these uncommon keywords, which in turn enables better readability of the results. The method boosts the probabilities of given keywords in a beam search based on acoustic model predictions. The method does not require any training in advance. We demonstrate the effectiveness of our method on the LibriSpeeech test sets and also internal data of real-world conversations. Our method significantly boosts keyword accuracy on the test sets, while maintaining the accuracy of the other words, and as well as providing significant qualitative improvements. This method is applicable to other tasks such as machine translation, or wherever unseen and difficult keywords need to be recognised in beam search.

  Access Paper or Ask Questions

The Microsoft 2016 Conversational Speech Recognition System

Jan 25, 2017
W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig

We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training provide significant gains for all acoustic model architectures. Language model rescoring with multiple forward and backward running RNNLMs, and word posterior-based system combination provide a 20% boost. The best single system uses a ResNet architecture acoustic model with RNNLM rescoring, and achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The combined system has an error rate of 6.2%, representing an improvement over previously reported results on this benchmark task.

* Proc. IEEE ICASSP, 2017 

  Access Paper or Ask Questions

Speech Recognition for Endangered and Extinct Samoyedic languages

Dec 09, 2020
Niko Partanen, Mika Hämäläinen, Tiina Klooster

Our study presents a series of experiments on speech recognition with endangered and extinct Samoyedic languages, spoken in Northern and Southern Siberia. To best of our knowledge, this is the first time a functional ASR system is built for an extinct language. We achieve with Kamas language a Label Error Rate of 15\%, and conclude through careful error analysis that this quality is already very useful as a starting point for refined human transcriptions. Our results with related Nganasan language are more modest, with best model having the error rate of 33\%. We show, however, through experiments where Kamas training data is enlarged incrementally, that Nganasan results are in line with what is expected under low-resource circumstances of the language. Based on this, we provide recommendations for scenarios in which further language documentation or archive processing activities could benefit from modern ASR technology. All training data and processing scripts haven been published on Zenodo with clear licences to ensure further work in this important topic.

* the 34th Pacific Asia Conference on Language, Information and Computation 

  Access Paper or Ask Questions