Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

From Gameplay to Symbolic Reasoning: Learning SAT Solver Heuristics in the Style of Alpha(Go) Zero

Feb 14, 2018
Fei Wang, Tiark Rompf

Despite the recent successes of deep neural networks in various fields such as image and speech recognition, natural language processing, and reinforcement learning, we still face big challenges in bringing the power of numeric optimization to symbolic reasoning. Researchers have proposed different avenues such as neural machine translation for proof synthesis, vectorization of symbols and expressions for representing symbolic patterns, and coupling of neural back-ends for dimensionality reduction with symbolic front-ends for decision making. However, these initial explorations are still only point solutions, and bear other shortcomings such as lack of correctness guarantees. In this paper, we present our approach of casting symbolic reasoning as games, and directly harnessing the power of deep reinforcement learning in the style of Alpha(Go) Zero on symbolic problems. Using the Boolean Satisfiability (SAT) problem as showcase, we demonstrate the feasibility of our method, and the advantages of modularity, efficiency, and correctness guarantees.

  Access Paper or Ask Questions

Automatic Measurement of Pre-aspiration

Jun 15, 2017
Yaniv Sheena, Míša Hejná, Yossi Adi, Joseph Keshet

Pre-aspiration is defined as the period of glottal friction occurring in sequences of vocalic/consonantal sonorants and phonetically voiceless obstruents. We propose two machine learning methods for automatic measurement of pre-aspiration duration: a feedforward neural network, which works at the frame level; and a structured prediction model, which relies on manually designed feature functions, and works at the segment level. The input for both algorithms is a speech signal of an arbitrary length containing a single obstruent, and the output is a pair of times which constitutes the pre-aspiration boundaries. We train both models on a set of manually annotated examples. Results suggest that the structured model is superior to the frame-based model as it yields higher accuracy in predicting the boundaries and generalizes to new speakers and new languages. Finally, we demonstrate the applicability of our structured prediction algorithm by replicating linguistic analysis of pre-aspiration in Aberystwyth English with high correlation.

  Access Paper or Ask Questions

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Mar 05, 2016
Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, Richard Socher

Most tasks in natural language processing can be cast into question answering (QA) problems over language input. We introduce the dynamic memory network (DMN), a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers. Questions trigger an iterative attention process which allows the model to condition its attention on the inputs and the result of previous iterations. These results are then reasoned over in a hierarchical recurrent sequence model to generate answers. The DMN can be trained end-to-end and obtains state-of-the-art results on several types of tasks and datasets: question answering (Facebook's bAbI dataset), text classification for sentiment analysis (Stanford Sentiment Treebank) and sequence modeling for part-of-speech tagging (WSJ-PTB). The training for these different tasks relies exclusively on trained word vector representations and input-question-answer triplets.

  Access Paper or Ask Questions

Corpus analysis without prior linguistic knowledge - unsupervised mining of phrases and subphrase structure

Feb 18, 2016
Stefan Gerdjikov, Klaus U. Schulz

When looking at the structure of natural language, "phrases" and "words" are central notions. We consider the problem of identifying such "meaningful subparts" of language of any length and underlying composition principles in a completely corpus-based and language-independent way without using any kind of prior linguistic knowledge. Unsupervised methods for identifying "phrases", mining subphrase structure and finding words in a fully automated way are described. This can be considered as a step towards automatically computing a "general dictionary and grammar of the corpus". We hope that in the long run variants of our approach turn out to be useful for other kind of sequence data as well, such as, e.g., speech, genom sequences, or music annotation. Even if we are not primarily interested in immediate applications, results obtained for a variety of languages show that our methods are interesting for many practical tasks in text mining, terminology extraction and lexicography, search engine technology, and related fields.

  Access Paper or Ask Questions

FST Based Morphological Analyzer for Hindi Language

Jul 23, 2012
Deepak Kumar, Manjeet Singh, Seema Shukla

Hindi being a highly inflectional language, FST (Finite State Transducer) based approach is most efficient for developing a morphological analyzer for this language. The work presented in this paper uses the SFST (Stuttgart Finite State Transducer) tool for generating the FST. A lexicon of root words is created. Rules are then added for generating inflectional and derivational words from these root words. The Morph Analyzer developed was used in a Part Of Speech (POS) Tagger based on Stanford POS Tagger. The system was first trained using a manually tagged corpus and MAXENT (Maximum Entropy) approach of Stanford POS tagger was then used for tagging input sentences. The morphological analyzer gives approximately 97% correct results. POS tagger gives an accuracy of approximately 87% for the sentences that have the words known to the trained model file, and 80% accuracy for the sentences that have the words unknown to the trained model file.

  Access Paper or Ask Questions

Probabilistic top-down parsing and language modeling

May 08, 2001
Brian Roark

This paper describes the functioning of a broad-coverage probabilistic top-down parser, and its application to the problem of language modeling for speech recognition. The paper first introduces key notions in language modeling and probabilistic parsing, and briefly reviews some previous approaches to using syntactic structure for language modeling. A lexicalized probabilistic top-down parser is then presented, which performs very well, in terms of both the accuracy of returned parses and the efficiency with which they are found, relative to the best broad-coverage statistical parsers. A new language model which utilizes probabilistic top-down parsing is then outlined, and empirical results show that it improves upon previous work in test corpus perplexity. Interpolation with a trigram model yields an exceptional improvement relative to the improvement observed by other models, demonstrating the degree to which the information captured by our parsing model is orthogonal to that captured by a trigram model. A small recognition experiment also demonstrates the utility of the model.

* 28 pages, 6 tables, 8 figures. To appear in Computational Linguistics 27(2), June 2001 

  Access Paper or Ask Questions

Automatic Depression Detection: An Emotional Audio-Textual Corpus and a GRU/BiLSTM-based Model

Feb 15, 2022
Ying Shen, Huiyu Yang, Lin Lin

Depression is a global mental health problem, the worst case of which can lead to suicide. An automatic depression detection system provides great help in facilitating depression self-assessment and improving diagnostic accuracy. In this work, we propose a novel depression detection approach utilizing speech characteristics and linguistic contents from participants' interviews. In addition, we establish an Emotional Audio-Textual Depression Corpus (EATD-Corpus) which contains audios and extracted transcripts of responses from depressed and non-depressed volunteers. To the best of our knowledge, EATD-Corpus is the first and only public depression dataset that contains audio and text data in Chinese. Evaluated on two depression datasets, the proposed method achieves the state-of-the-art performances. The outperforming results demonstrate the effectiveness and generalization ability of the proposed method. The source code and EATD-Corpus are available at

  Access Paper or Ask Questions

Towards a Real-time Measure of the Perception of Anthropomorphism in Human-robot Interaction

Jan 24, 2022
Maria Tsfasman, Avinash Saravanan, Dekel Viner, Daan Goslinga, Sarah de Wolf, Chirag Raman, Catholijn M. Jonker, Catharine Oertel

How human-like do conversational robots need to look to enable long-term human-robot conversation? One essential aspect of long-term interaction is a human's ability to adapt to the varying degrees of a conversational partner's engagement and emotions. Prosodically, this can be achieved through (dis)entrainment. While speech-synthesis has been a limiting factor for many years, restrictions in this regard are increasingly mitigated. These advancements now emphasise the importance of studying the effect of robot embodiment on human entrainment. In this study, we conducted a between-subjects online human-robot interaction experiment in an educational use-case scenario where a tutor was either embodied through a human or a robot face. 43 English-speaking participants took part in the study for whom we analysed the degree of acoustic-prosodic entrainment to the human or robot face, respectively. We found that the degree of subjective and objective perception of anthropomorphism positively correlates with acoustic-prosodic entrainment.

* MuCAI'21: Proceedings of the 2nd ACM Multimedia Workshop on Multimodal Conversational AI, 2021 

  Access Paper or Ask Questions

MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification

Nov 11, 2021
Ladislav Mošner, Oldřich Plchot, Lukáš Burget, Jan Černocký

Motivated by unconsolidated data situation and the lack of a standard benchmark in the field, we complement our previous efforts and present a comprehensive corpus designed for training and evaluating text-independent multi-channel speaker verification systems. It can be readily used also for experiments with dereverberation, denoising, and speech enhancement. We tackled the ever-present problem of the lack of multi-channel training data by utilizing data simulation on top of clean parts of the Voxceleb dataset. The development and evaluation trials are based on a retransmitted Voices Obscured in Complex Environmental Settings (VOiCES) corpus, which we modified to provide multi-channel trials. We publish full recipes that create the dataset from public sources as the MultiSV corpus, and we provide results with two of our multi-channel speaker verification systems with neural network-based beamforming based either on predicting ideal binary masks or the more recent Conv-TasNet.

* Submitted to ICASSP 2022 

  Access Paper or Ask Questions

Semantic sentence similarity: size does not always matter

Jun 16, 2021
Danny Merkx, Stefan L. Frank, Mirjam Ernestus

This study addresses the question whether visually grounded speech recognition (VGS) models learn to capture sentence semantics without access to any prior linguistic knowledge. We produce synthetic and natural spoken versions of a well known semantic textual similarity database and show that our VGS model produces embeddings that correlate well with human semantic similarity judgements. Our results show that a model trained on a small image-caption database outperforms two models trained on much larger databases, indicating that database size is not all that matters. We also investigate the importance of having multiple captions per image and find that this is indeed helpful even if the total number of images is lower, suggesting that paraphrasing is a valuable learning signal. While the general trend in the field is to create ever larger datasets to train models on, our findings indicate other characteristics of the database can just as important important.

* This paper has been accepted at Interspeech 2021 where it will be presented and appear in the conference proceedings in September 2021 

  Access Paper or Ask Questions