Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

A Generative Model of a Pronunciation Lexicon for Hindi

May 06, 2017
Pramod Pandey, Somnath Roy

Voice browser applications in Text-to- Speech (TTS) and Automatic Speech Recognition (ASR) systems crucially depend on a pronunciation lexicon. The present paper describes the model of pronunciation lexicon of Hindi developed to automatically generate the output forms of Hindi at two levels, the and the (PS, in short for Prosodic Structure). The latter level involves both syllable-division and stress placement. The paper describes the tool developed for generating the two-level outputs of lexica in Hindi.


  Access Paper or Ask Questions

:telephone::person::sailboat::whale::okhand:; or "Call me Ishmael" - How do you translate emoji?

Nov 07, 2016
Will Radford, Andrew Chisholm, Ben Hachey, Bo Han

We report on an exploratory analysis of Emoji Dick, a project that leverages crowdsourcing to translate Melville's Moby Dick into emoji. This distinctive use of emoji removes textual context, and leads to a varying translation quality. In this paper, we use statistical word alignment and part-of-speech tagging to explore how people use emoji. Despite these simple methods, we observed differences in token and part-of-speech distributions. Experiments also suggest that semantics are preserved in the translation, and repetition is more common in emoji.


  Access Paper or Ask Questions

LemmaTag: Jointly Tagging and Lemmatizing for Morphologically-Rich Languages with BRNNs

Aug 27, 2018
Daniel Kondratyuk, Tomáš Gavenčiak, Milan Straka, Jan Hajič

We present LemmaTag, a featureless neural network architecture that jointly generates part-of-speech tags and lemmas for sentences by using bidirectional RNNs with character-level and word-level embeddings. We demonstrate that both tasks benefit from sharing the encoding part of the network, predicting tag subcategories, and using the tagger output as an input to the lemmatizer. We evaluate our model across several languages with complex morphology, which surpasses state-of-the-art accuracy in both part-of-speech tagging and lemmatization in Czech, German, and Arabic.

* 8 pages, 3 figures. Submitted to EMNLP 2018 

  Access Paper or Ask Questions

Adobe-MIT submission to the DSTC 4 Spoken Language Understanding pilot task

May 07, 2016
Franck Dernoncourt, Ji Young Lee, Trung H. Bui, Hung H. Bui

The Dialog State Tracking Challenge 4 (DSTC 4) proposes several pilot tasks. In this paper, we focus on the spoken language understanding pilot task, which consists of tagging a given utterance with speech acts and semantic slots. We compare different classifiers: the best system obtains 0.52 and 0.67 F1-scores on the test set for speech act recognition for the tourist and the guide respectively, and 0.52 F1-score for semantic tagging for both the guide and the tourist.

* Paper accepted at IWSDS 2016 

  Access Paper or Ask Questions

Model based neuro-fuzzy ASR on Texas processor

Sep 24, 2012
Hesam Ekhtiyar, Mehdi Sheida, Somaye Sobati Moghadam

In this paper an algorithm for recognizing speech has been proposed. The recognized speech is used to execute related commands which use the MFCC and two kind of classifiers, first one uses MLP and second one uses fuzzy inference system as a classifier. The experimental results demonstrate the high gain and efficiency of the proposed algorithm. We have implemented this system based on graphical design and tested on a fix point digital signal processor (DSP) of 600 MHz, with reference DM6437-EVM of Texas instrument.


  Access Paper or Ask Questions

Recovering From Parser Failures: A Hybrid Statistical/Symbolic Approach

Jul 28, 1994
Carolyn Penstein Rose', Alex Waibel

We describe an implementation of a hybrid statistical/symbolic approach to repairing parser failures in a speech-to-speech translation system. We describe a module which takes as input a fragmented parse and returns a repaired meaning representation. It negotiates with the speaker about what the complete meaning of the utterance is by generating hypotheses about how to fit the fragments of the partial parse together into a coherent meaning representation. By drawing upon both statistical and symbolic information, it constrains its repair hypotheses to those which are both likely and meaningful. Because it updates its statistical model during use, it improves its performance over time.


  Access Paper or Ask Questions

AI4D -- African Language Dataset Challenge

Jul 23, 2020
Kathleen Siminyu, Sackey Freshia, Jade Abbott, Vukosi Marivate

As language and speech technologies become more advanced, the lack of fundamental digital resources for African languages, such as data, spell checkers and Part of Speech taggers, means that the digital divide between these languages and others keeps growing. This work details the organisation of the AI4D - African Language Dataset Challenge, an effort to incentivize the creation, organization and discovery of African language datasets through a competitive challenge. We particularly encouraged the submission of annotated datasets which can be used for training task-specific supervised machine learning models.


  Access Paper or Ask Questions

On the Mutual Information between Source and Filter Contributions for Voice Pathology Detection

Jan 02, 2020
Thomas Drugman, Thomas Dubuisson, Thierry Dutoit

This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal. For this, we investigate the use of the glottal source estimation as a means to detect voice disorders. Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody. The relevancy of these features is assessed through mutual information-based measures. This allows an intuitive interpretation in terms of discrimation power and redundancy between the features, independently of any subsequent classifier. It is discussed which characteristics are interestingly informative or complementary for detecting voice pathologies.


  Access Paper or Ask Questions

TristouNet: Triplet Loss for Speaker Turn Embedding

Apr 11, 2017
Hervé Bredin

TristouNet is a neural network architecture based on Long Short-Term Memory recurrent networks, meant to project speech sequences into a fixed-dimensional euclidean space. Thanks to the triplet loss paradigm used for training, the resulting sequence embeddings can be compared directly with the euclidean distance, for speaker comparison purposes. Experiments on short (between 500ms and 5s) speech turn comparison and speaker change detection show that TristouNet brings significant improvements over the current state-of-the-art techniques for both tasks.

* ICASSP 2017 (42nd IEEE International Conference on Acoustics, Speech and Signal Processing). Code available at http://github.com/hbredin/TristouNet 

  Access Paper or Ask Questions

Unsupervised Learning of Word-Category Guessing Rules

Apr 30, 1996
Andrei Mikheev

Words unknown to the lexicon present a substantial problem to part-of-speech tagging. In this paper we present a technique for fully unsupervised statistical acquisition of rules which guess possible parts-of-speech for unknown words. Three complementary sets of word-guessing rules are induced from the lexicon and a raw corpus: prefix morphological rules, suffix morphological rules and ending-guessing rules. The learning was performed on the Brown Corpus data and rule-sets, with a highly competitive performance, were produced and compared with the state-of-the-art.

* 8 pages, LaTeX (aclap.sty for ACL-96); Proceedings of ACL-96 Santa Cruz, USA; also see cmp-lg/9604025 

  Access Paper or Ask Questions

<<
374
375
376
377
378
379
380
381
382
383
384
385
386
>>