Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Variable-rate discrete representation learning

Mar 10, 2021
Sander Dieleman, Charlie Nash, Jesse Engel, Karen Simonyan

Semantically meaningful information content in perceptual signals is usually unevenly distributed. In speech signals for example, there are often many silences, and the speed of pronunciation can vary considerably. In this work, we propose slow autoencoders (SlowAEs) for unsupervised learning of high-level variable-rate discrete representations of sequences, and apply them to speech. We show that the resulting event-based representations automatically grow or shrink depending on the density of salient information in the input signals, while still allowing for faithful signal reconstruction. We develop run-length Transformers (RLTs) for event-based representation modelling and use them to construct language models in the speech domain, which are able to generate grammatical and semantically coherent utterances and continuations.

* 26 pages, 15 figures, samples can be found at 

  Access Paper or Ask Questions

Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS

Aug 11, 2020
Rui Liu, Berrak Sisman, Feilong Bao, Guanglai Gao, Haizhou Li

Tacotron-based end-to-end speech synthesis has shown remarkable voice quality. However, the rendering of prosody in the synthesized speech remains to be improved, especially for long sentences, where prosodic phrasing errors can occur frequently. In this paper, we extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We propose a multi-task learning scheme for Tacotron training, that optimizes the system to predict both Mel spectrum and phrase breaks. To our best knowledge, this is the first implementation of multi-task learning for Tacotron based TTS with a prosodic phrasing model. Experiments show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.

* To appear in IEEE Signal Processing Letters (SPL) 

  Access Paper or Ask Questions

Mutli-task Learning with Alignment Loss for Far-field Small-Footprint Keyword Spotting

May 07, 2020
Haiwei Wu, Yan Jia, Yuanfei Nie, Ming Li

In this paper, we focus on the task of small-footprint keyword spotting under the far-field scenario. Far-field environments are commonly encountered in real-life speech applications, and it causes serve degradation of performance due to room reverberation and various kinds of noises. Our baseline system is built on the convolutional neural network trained with pooled data of both far-field and close-talking speech. To cope with the distortions, we adopt the multi-task learning scheme with alignment loss to reduce the mismatch between the embedding features learned from different domains of data. Experimental results show that our proposed method maintains the performance on close-talking speech and achieves significant improvement on the far-field test set.

* Submitted to INTERSPEECH 2020 

  Access Paper or Ask Questions

Cross-stitched Multi-modal Encoders

Apr 20, 2022
Karan Singla, Daniel Pressel, Ryan Price, Bhargav Srinivas Chinnari, Yeon-Jun Kim, Srinivas Bangalore

In this paper, we propose a novel architecture for multi-modal speech and text input. We combine pretrained speech and text encoders using multi-headed cross-modal attention and jointly fine-tune on the target problem. The resultant architecture can be used for continuous token-level classification or utterance-level prediction acting on simultaneous text and speech. The resultant encoder efficiently captures both acoustic-prosodic and lexical information. We compare the benefits of multi-headed attention-based fusion for multi-modal utterance-level classification against a simple concatenation of pre-pooled, modality-specific representations. Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.

  Access Paper or Ask Questions

textless-lib: a Library for Textless Spoken Language Processing

Feb 15, 2022
Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources. In this paper, we introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability by discuss three different use-case examples: (i) speaker probing, (ii) speech resynthesis and compression, and (iii) speech continuation. We believe that textless-lib substantially simplifies research the textless setting and will be handful not only for speech researchers but also for the NLP community at large. The code, documentation, and pre-trained models are available at .

* The library is available here 

  Access Paper or Ask Questions

Mobile Keyboard Input Decoding with Finite-State Transducers

Apr 13, 2017
Tom Ouyang, David Rybach, Françoise Beaufays, Michael Riley

We propose a finite-state transducer (FST) representation for the models used to decode keyboard inputs on mobile devices. Drawing from learnings from the field of speech recognition, we describe a decoding framework that can satisfy the strict memory and latency constraints of keyboard input. We extend this framework to support functionalities typically not present in speech recognition, such as literal decoding, autocorrections, word completions, and next word predictions. We describe the general framework of what we call for short the keyboard "FST decoder" as well as the implementation details that are new compared to a speech FST decoder. We demonstrate that the FST decoder enables new UX features such as post-corrections. Finally, we sketch how this decoder can support advanced features such as personalization and contextualization.

  Access Paper or Ask Questions

Talking Face Generation with Multilingual TTS

May 13, 2022
Hyoung-Kyu Song, Sang Hoon Woo, Junhyeok Lee, Seungmin Yang, Hyunjae Cho, Youseong Lee, Dongho Choi, Kang-wook Kim

In this work, we propose a joint system combining a talking face generation system with a text-to-speech system that can generate multilingual talking face videos from only the text input. Our system can synthesize natural multilingual speeches while maintaining the vocal identity of the speaker, as well as lip movements synchronized to the synthesized speech. We demonstrate the generalization capabilities of our system by selecting four languages (Korean, English, Japanese, and Chinese) each from a different language family. We also compare the outputs of our talking face generation model to outputs of a prior work that claims multilingual support. For our demo, we add a translation API to the preprocessing stage and present it in the form of a neural dubber so that users can utilize the multilingual property of our system more easily.

* Accepted to CVPR Demo Track (2022) 

  Access Paper or Ask Questions

Summary On The ICASSP 2022 Multi-Channel Multi-Party Meeting Transcription Grand Challenge

Feb 08, 2022
Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, Siqi Zheng, Weilong Huang, Lei Xie, Zheng-Hua Tan, DeLiang Wang, Yanmin Qian, Kong Aik Lee, Zhijie Yan, Bin Ma, Xin Xu, Hui Bu

The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Mandarin meeting speech data with manual annotation, including far-field data collected by 8-channel microphone array as well as near-field data collected by each participants' headset microphone. We briefly describe the released dataset, track setups, baselines and summarize the challenge results and major techniques used in the submissions.

* 5 pages, 4 tables 

  Access Paper or Ask Questions

Self-supervised reinforcement learning for speaker localisation with the iCub humanoid robot

Nov 12, 2020
Jonas Gonzalez-Billandon, Lukas Grasse, Matthew Tata, Alessandra Sciutti, Francesco Rea

In the future robots will interact more and more with humans and will have to communicate naturally and efficiently. Automatic speech recognition systems (ASR) will play an important role in creating natural interactions and making robots better companions. Humans excel in speech recognition in noisy environments and are able to filter out noise. Looking at a person's face is one of the mechanisms that humans rely on when it comes to filtering speech in such noisy environments. Having a robot that can look toward a speaker could benefit ASR performance in challenging environments. To this aims, we propose a self-supervised reinforcement learning-based framework inspired by the early development of humans to allow the robot to autonomously create a dataset that is later used to learn to localize speakers with a deep learning network.

  Access Paper or Ask Questions

Stutter Diagnosis and Therapy System Based on Deep Learning

Jul 13, 2020
Gresha Bhatia, Binoy Saha, Mansi Khamkar, Ashish Chandwani, Reshma Khot

Stuttering, also called stammering, is a communication disorder that breaks the continuity of the speech. This program of work is an attempt to develop automatic recognition procedures to assess stuttered dysfluencies and use these assessments to filter out speech therapies for an individual. Stuttering may be in the form of repetitions, prolongations or abnormal stoppages of sounds and syllables. Our system aims to help stutterers by diagnosing the severity and type of stutter and also by suggesting appropriate therapies for practice by learning the correlation between stutter descriptors and the effectiveness of speech therapies on them. This paper focuses on the implementation of a stutter diagnosis agent using Gated Recurrent CNN on MFCC audio features and therapy recommendation agent using SVM. It also presents the results obtained and various key findings of the system developed.

* About stutter classification, severity diagnosis and therapy recommendation 

  Access Paper or Ask Questions