Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Adobe-MIT submission to the DSTC 4 Spoken Language Understanding pilot task

May 07, 2016
Franck Dernoncourt, Ji Young Lee, Trung H. Bui, Hung H. Bui

The Dialog State Tracking Challenge 4 (DSTC 4) proposes several pilot tasks. In this paper, we focus on the spoken language understanding pilot task, which consists of tagging a given utterance with speech acts and semantic slots. We compare different classifiers: the best system obtains 0.52 and 0.67 F1-scores on the test set for speech act recognition for the tourist and the guide respectively, and 0.52 F1-score for semantic tagging for both the guide and the tourist.

* Paper accepted at IWSDS 2016 

  Access Paper or Ask Questions

Model based neuro-fuzzy ASR on Texas processor

Sep 24, 2012
Hesam Ekhtiyar, Mehdi Sheida, Somaye Sobati Moghadam

In this paper an algorithm for recognizing speech has been proposed. The recognized speech is used to execute related commands which use the MFCC and two kind of classifiers, first one uses MLP and second one uses fuzzy inference system as a classifier. The experimental results demonstrate the high gain and efficiency of the proposed algorithm. We have implemented this system based on graphical design and tested on a fix point digital signal processor (DSP) of 600 MHz, with reference DM6437-EVM of Texas instrument.

  Access Paper or Ask Questions

Recovering From Parser Failures: A Hybrid Statistical/Symbolic Approach

Jul 28, 1994
Carolyn Penstein Rose', Alex Waibel

We describe an implementation of a hybrid statistical/symbolic approach to repairing parser failures in a speech-to-speech translation system. We describe a module which takes as input a fragmented parse and returns a repaired meaning representation. It negotiates with the speaker about what the complete meaning of the utterance is by generating hypotheses about how to fit the fragments of the partial parse together into a coherent meaning representation. By drawing upon both statistical and symbolic information, it constrains its repair hypotheses to those which are both likely and meaningful. Because it updates its statistical model during use, it improves its performance over time.

  Access Paper or Ask Questions

AI4D -- African Language Dataset Challenge

Jul 23, 2020
Kathleen Siminyu, Sackey Freshia, Jade Abbott, Vukosi Marivate

As language and speech technologies become more advanced, the lack of fundamental digital resources for African languages, such as data, spell checkers and Part of Speech taggers, means that the digital divide between these languages and others keeps growing. This work details the organisation of the AI4D - African Language Dataset Challenge, an effort to incentivize the creation, organization and discovery of African language datasets through a competitive challenge. We particularly encouraged the submission of annotated datasets which can be used for training task-specific supervised machine learning models.

  Access Paper or Ask Questions

On the Mutual Information between Source and Filter Contributions for Voice Pathology Detection

Jan 02, 2020
Thomas Drugman, Thomas Dubuisson, Thierry Dutoit

This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal. For this, we investigate the use of the glottal source estimation as a means to detect voice disorders. Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody. The relevancy of these features is assessed through mutual information-based measures. This allows an intuitive interpretation in terms of discrimation power and redundancy between the features, independently of any subsequent classifier. It is discussed which characteristics are interestingly informative or complementary for detecting voice pathologies.

  Access Paper or Ask Questions

TristouNet: Triplet Loss for Speaker Turn Embedding

Apr 11, 2017
Hervé Bredin

TristouNet is a neural network architecture based on Long Short-Term Memory recurrent networks, meant to project speech sequences into a fixed-dimensional euclidean space. Thanks to the triplet loss paradigm used for training, the resulting sequence embeddings can be compared directly with the euclidean distance, for speaker comparison purposes. Experiments on short (between 500ms and 5s) speech turn comparison and speaker change detection show that TristouNet brings significant improvements over the current state-of-the-art techniques for both tasks.

* ICASSP 2017 (42nd IEEE International Conference on Acoustics, Speech and Signal Processing). Code available at 

  Access Paper or Ask Questions

Unsupervised Learning of Word-Category Guessing Rules

Apr 30, 1996
Andrei Mikheev

Words unknown to the lexicon present a substantial problem to part-of-speech tagging. In this paper we present a technique for fully unsupervised statistical acquisition of rules which guess possible parts-of-speech for unknown words. Three complementary sets of word-guessing rules are induced from the lexicon and a raw corpus: prefix morphological rules, suffix morphological rules and ending-guessing rules. The learning was performed on the Brown Corpus data and rule-sets, with a highly competitive performance, were produced and compared with the state-of-the-art.

* 8 pages, LaTeX (aclap.sty for ACL-96); Proceedings of ACL-96 Santa Cruz, USA; also see cmp-lg/9604025 

  Access Paper or Ask Questions

Recognition Performance of a Structured Language Model

Jan 24, 2000
Ciprian Chelba, Frederick Jelinek

A new language model for speech recognition inspired by linguistic analysis is presented. The model develops hidden hierarchical structure incrementally and uses it to extract meaningful information from the word history - thus enabling the use of extended distance dependencies - in an attempt to complement the locality of currently used trigram models. The structured language model, its probabilistic parameterization and performance in a two-pass speech recognizer are presented. Experiments on the SWITCHBOARD corpus show an improvement in both perplexity and word error rate over conventional trigram models.

* Proceedings of Eurospeech, 1999, pp. 1567-1570, Budapest, Hungary 
* 4 pages 

  Access Paper or Ask Questions

Better Language Models with Model Merging

Apr 17, 1996
Thorsten Brants

This paper investigates model merging, a technique for deriving Markov models from text or speech corpora. Models are derived by starting with a large and specific model and by successively combining states to build smaller and more general models. We present methods to reduce the time complexity of the algorithm and report on experiments on deriving language models for a speech recognition task. The experiments show the advantage of model merging over the standard bigram approach. The merged model assigns a lower perplexity to the test set and uses considerably fewer states.

* LaTeX, 9 pages. In Proceedings of EMNLP-96, Philadelphia, PA 

  Access Paper or Ask Questions

Task splitting for DNN-based acoustic echo and noise removal

May 13, 2022
Sebastian Braun, Maria Luis Valero

Neural networks have led to tremendous performance gains for single-task speech enhancement, such as noise suppression and acoustic echo cancellation (AEC). In this work, we evaluate whether it is more useful to use a single joint or separate modules to tackle these problems. We describe different possible implementations and give insights into their performance and efficiency. We show that using a separate echo cancellation module and a module for noise and residual echo removal results in less near-end speech distortion and better echo suppression, especially for double-talk.

* submitted to IWAENC 2022 

  Access Paper or Ask Questions