Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Singing Synthesis: with a little help from my attention

Dec 12, 2019
Orazio Angelini, Alexis Moinet, Kayoko Yanagisawa, Thomas Drugman

We present a novel system for singing synthesis, based on attention. Starting from a musical score with notes and lyrics, we build a phoneme-level multi stream note embedding. The embedding contains the information encoded in the score regarding pitch, duration and the phonemes to be pronounced on each note. This note representation is used to condition an attention-based sequence-to-sequence architecture, in order to generate mel-spectrograms. Our model demonstrates attention can be successfully applied to the singing synthesis field. The system requires considerably less explicit modelling of voice features such as F0 patterns, vibratos, and note and phoneme durations, than most models in the literature. However, we observe that completely dispensing with any duration modelling introduces occasional instabilities in the generated spectrograms. We train an autoregressive WaveNet to be used as a neural vocoder to synthesise the mel-spectrograms produced by the sequence-to-sequence architecture, using a combination of speech and singing data.

* Submitted to ICASSP 2020 

  Access Paper or Ask Questions

Efficient Dynamic WFST Decoding for Personalized Language Models

Oct 23, 2019
Jun Liu, Jiedan Zhu, Vishal Kathuria, Fuchun Peng

We propose a two-layer cache mechanism to speed up dynamic WFST decoding with personalized language models. The first layer is a public cache that stores most of the static part of the graph. This is shared globally among all users. A second layer is a private cache that caches the graph that represents the personalized language model, which is only shared by the utterances from a particular user. We also propose two simple yet effective pre-initialization methods, one based on breadth-first search, and another based on a data-driven exploration of decoder states using previous utterances. Experiments with a calling speech recognition task using a personalized contact list demonstrate that the proposed public cache reduces decoding time by factor of three compared to decoding without pre-initialization. Using the private cache provides additional efficiency gains, reducing the decoding time by a factor of five.

* 5 pages, 4 figures 

  Access Paper or Ask Questions

Robot Sound Interpretation: Combining Sight and Sound in Learning-Based Control

Sep 19, 2019
Peixin Chang, Shuijing Liu, Haonan Chen, Katherine Driggs-Campbell

We explore the interpretation of sound for robot decision-making, inspired by human speech comprehension. While previous methods use natural language processing to translate sound to text, we propose an end-to-end deep neural network which directly learns control polices from images and sound signals. The network is trained using reinforcement learning with auxiliary losses on the sight and sound network branches. We demonstrate our approach on two robots, a TurtleBot3 and a Kuka-IIWA arm, which hear a command word, identify the associated target object, and perform precise control to reach the target. For both systems, we perform ablation studies in simulation to show the effectiveness of our network empirically. We also successfully transfer the policy learned in simulator to a real-world TurtleBot3, which effectively understands word commands, searches for the object, and moves toward that location with more intuitive motion than a traditional motion planner with perfect information.

  Access Paper or Ask Questions

Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences

Sep 18, 2019
Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, Pascale Fung

Training code-switched language models is difficult due to lack of data and complexity in the grammatical structure. Linguistic constraint theories have been used for decades to generate artificial code-switching sentences to cope with this issue. However, this require external word alignments or constituency parsers that create erroneous results on distant languages. We propose a sequence-to-sequence model using a copy mechanism to generate code-switching data by leveraging parallel monolingual translations from a limited source of code-switching data. The model learns how to combine words from parallel sentences and identifies when to switch one language to the other. Moreover, it captures code-switching constraints by attending and aligning the words in inputs, without requiring any external knowledge. Based on experimental results, the language model trained with the generated sentences achieves state-of-the-art performance and improves end-to-end automatic speech recognition.

* Accepted in CoNLL 2019 

  Access Paper or Ask Questions

Avaya Conversational Intelligence: A Real-Time System for Spoken Language Understanding in Human-Human Call Center Conversations

Sep 02, 2019
Jan Mizgajski, Adrian Szymczak, Robert G艂owski, Piotr Szyma艅ski, Piotr 呕elasko, 艁ukasz Augustyniak, Miko艂aj Morzy, Yishay Carmiel, Jeff Hodson, 艁ukasz W贸jciak, Daniel Smoczyk, Adam Wr贸bel, Bartosz Borowik, Adam Artajew, Marcin Baran, Cezary Kwiatkowski, Marzena 呕y艂a-Hoppe

Avaya Conversational Intelligence(ACI) is an end-to-end, cloud-based solution for real-time Spoken Language Understanding for call centers. It combines large vocabulary, real-time speech recognition, transcript refinement, and entity and intent recognition in order to convert live audio into a rich, actionable stream of structured events. These events can be further leveraged with a business rules engine, thus serving as a foundation for real-time supervision and assistance applications. After the ingestion, calls are enriched with unsupervised keyword extraction, abstractive summarization, and business-defined attributes, enabling offline use cases, such as business intelligence, topic mining, full-text search, quality assurance, and agent training. ACI comes with a pretrained, configurable library of hundreds of intents and a robust intent training environment that allows for efficient, cost-effective creation and customization of customer-specific intents.

* Accepted for Interspeech 2019 

  Access Paper or Ask Questions

A Fully Differentiable Beam Search Decoder

Feb 16, 2019
Ronan Collobert, Awni Hannun, Gabriel Synnaeve

We introduce a new beam search decoder that is fully differentiable, making it possible to optimize at training time through the inference procedure. Our decoder allows us to combine models which operate at different granularities (e.g. acoustic and language models). It can be used when target sequences are not aligned to input sequences by considering all possible alignments between the two. We demonstrate our approach scales by applying it to speech recognition, jointly training acoustic and word-level language models. The system is end-to-end, with gradients flowing through the whole architecture from the word-level transcriptions. Recent research efforts have shown that deep neural networks with attention-based mechanisms are powerful enough to successfully train an acoustic model from the final transcription, while implicitly learning a language model. Instead, we show that it is possible to discriminatively train an acoustic model jointly with an explicit and possibly pre-trained language model.

  Access Paper or Ask Questions

Quality Measures for Speaker Verification with Short Utterances

Jan 29, 2019
Arnab Poddar, Md Sahidullah, Goutam Saha

The performances of the automatic speaker verification (ASV) systems degrade due to the reduction in the amount of speech used for enrollment and verification. Combining multiple systems based on different features and classifiers considerably reduces speaker verification error rate with short utterances. This work attempts to incorporate supplementary information during the system combination process. We use quality of the estimated model parameters as supplementary information. We introduce a class of novel quality measures formulated using the zero-order sufficient statistics used during the i-vector extraction process. We have used the proposed quality measures as side information for combining ASV systems based on Gaussian mixture model-universal background model (GMM-UBM) and i-vector. Considerable improvement is found in performance metrics by the proposed system on NIST SRE corpora in short duration conditions. We have observed improvement over state-of-the-art i-vector system.

* Accepted for publication in Digital Signal Processing: A Review Journal 

  Access Paper or Ask Questions

Confidence Estimation and Deletion Prediction Using Bidirectional Recurrent Neural Networks

Oct 30, 2018
Anton Ragni, Qiujia Li, Mark Gales, Yu Wang

The standard approach to assess reliability of automatic speech transcriptions is through the use of confidence scores. If accurate, these scores provide a flexible mechanism to flag transcription errors for upstream and downstream applications. One challenging type of errors that recognisers make are deletions. These errors are not accounted for by the standard confidence estimation schemes and are hard to rectify in the upstream and downstream processing. High deletion rates are prominent in limited resource and highly mismatched training/testing conditions studied under IARPA Babel and Material programs. This paper looks at the use of bidirectional recurrent neural networks to yield confidence estimates in predicted as well as deleted words. Several simple schemes are examined for combination. To assess usefulness of this approach, the combined confidence score is examined for untranscribed data selection that favours transcriptions with lower deletion errors. Experiments are conducted using IARPA Babel/Material program languages.

* Accepted as a conference paper at 2018 IEEE Workshop on Spoken Language Technology (SLT 2018) 

  Access Paper or Ask Questions

PhoneMD: Learning to Diagnose Parkinson's Disease from Smartphone Data

Oct 01, 2018
Patrick Schwab, Walter Karlen

Parkinson's disease is a neurodegenerative disease that can affect a person's movement, speech, dexterity, and cognition. Physicians primarily diagnose Parkinson's disease by performing a clinical assessment of symptoms. However, misdiagnoses are common. One factor that contributes to misdiagnoses is that the symptoms of Parkinson's disease may not be prominent at the time the clinical assessment is performed. Here, we present a machine-learning approach towards distinguishing between healthy people and people with Parkinson's disease using long-term data collected from smartphone-based tests, including walking, voice, tapping and memory tests. We demonstrate that the presented approach leads to significant performance improvements over existing methods (area under the receiver operating characteristic curve = 0.85) in data from a cohort of 1853 participants. Our results confirm that smartphone data collected over extended periods of time could in the future potentially be used as additional evidence for the diagnosis of Parkinson's disease.

  Access Paper or Ask Questions

Deep Unsupervised Multi-View Detection of Video Game Stream Highlights

Jul 25, 2018
Charles Ringer, Mihalis A. Nicolaou

We consider the problem of automatic highlight-detection in video game streams. Currently, the vast majority of highlight-detection systems for games are triggered by the occurrence of hard-coded game events (e.g., score change, end-game), while most advanced tools and techniques are based on detection of highlights via visual analysis of game footage. We argue that in the context of game streaming, events that may constitute highlights are not only dependent on game footage, but also on social signals that are conveyed by the streamer during the play session (e.g., when interacting with viewers, or when commenting and reacting to the game). In this light, we present a multi-view unsupervised deep learning methodology for novelty-based highlight detection. The method jointly analyses both game footage and social signals such as the players facial expressions and speech, and shows promising results for generating highlights on streams of popular games such as Player Unknown's Battlegrounds.

* Foundation of Digital Games 2018, 6 pages 

  Access Paper or Ask Questions