The Dialog State Tracking Challenge 4 (DSTC 4) proposes several pilot tasks. In this paper, we focus on the spoken language understanding pilot task, which consists of tagging a given utterance with speech acts and semantic slots. We compare different classifiers: the best system obtains 0.52 and 0.67 F1-scores on the test set for speech act recognition for the tourist and the guide respectively, and 0.52 F1-score for semantic tagging for both the guide and the tourist.
In this paper an algorithm for recognizing speech has been proposed. The recognized speech is used to execute related commands which use the MFCC and two kind of classifiers, first one uses MLP and second one uses fuzzy inference system as a classifier. The experimental results demonstrate the high gain and efficiency of the proposed algorithm. We have implemented this system based on graphical design and tested on a fix point digital signal processor (DSP) of 600 MHz, with reference DM6437-EVM of Texas instrument.
We describe an implementation of a hybrid statistical/symbolic approach to repairing parser failures in a speech-to-speech translation system. We describe a module which takes as input a fragmented parse and returns a repaired meaning representation. It negotiates with the speaker about what the complete meaning of the utterance is by generating hypotheses about how to fit the fragments of the partial parse together into a coherent meaning representation. By drawing upon both statistical and symbolic information, it constrains its repair hypotheses to those which are both likely and meaningful. Because it updates its statistical model during use, it improves its performance over time.
As language and speech technologies become more advanced, the lack of fundamental digital resources for African languages, such as data, spell checkers and Part of Speech taggers, means that the digital divide between these languages and others keeps growing. This work details the organisation of the AI4D - African Language Dataset Challenge, an effort to incentivize the creation, organization and discovery of African language datasets through a competitive challenge. We particularly encouraged the submission of annotated datasets which can be used for training task-specific supervised machine learning models.
This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal. For this, we investigate the use of the glottal source estimation as a means to detect voice disorders. Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody. The relevancy of these features is assessed through mutual information-based measures. This allows an intuitive interpretation in terms of discrimation power and redundancy between the features, independently of any subsequent classifier. It is discussed which characteristics are interestingly informative or complementary for detecting voice pathologies.
TristouNet is a neural network architecture based on Long Short-Term Memory recurrent networks, meant to project speech sequences into a fixed-dimensional euclidean space. Thanks to the triplet loss paradigm used for training, the resulting sequence embeddings can be compared directly with the euclidean distance, for speaker comparison purposes. Experiments on short (between 500ms and 5s) speech turn comparison and speaker change detection show that TristouNet brings significant improvements over the current state-of-the-art techniques for both tasks.
Words unknown to the lexicon present a substantial problem to part-of-speech tagging. In this paper we present a technique for fully unsupervised statistical acquisition of rules which guess possible parts-of-speech for unknown words. Three complementary sets of word-guessing rules are induced from the lexicon and a raw corpus: prefix morphological rules, suffix morphological rules and ending-guessing rules. The learning was performed on the Brown Corpus data and rule-sets, with a highly competitive performance, were produced and compared with the state-of-the-art.
A new language model for speech recognition inspired by linguistic analysis is presented. The model develops hidden hierarchical structure incrementally and uses it to extract meaningful information from the word history - thus enabling the use of extended distance dependencies - in an attempt to complement the locality of currently used trigram models. The structured language model, its probabilistic parameterization and performance in a two-pass speech recognizer are presented. Experiments on the SWITCHBOARD corpus show an improvement in both perplexity and word error rate over conventional trigram models.
This paper investigates model merging, a technique for deriving Markov models from text or speech corpora. Models are derived by starting with a large and specific model and by successively combining states to build smaller and more general models. We present methods to reduce the time complexity of the algorithm and report on experiments on deriving language models for a speech recognition task. The experiments show the advantage of model merging over the standard bigram approach. The merged model assigns a lower perplexity to the test set and uses considerably fewer states.
Neural networks have led to tremendous performance gains for single-task speech enhancement, such as noise suppression and acoustic echo cancellation (AEC). In this work, we evaluate whether it is more useful to use a single joint or separate modules to tackle these problems. We describe different possible implementations and give insights into their performance and efficiency. We show that using a separate echo cancellation module and a module for noise and residual echo removal results in less near-end speech distortion and better echo suppression, especially for double-talk.