Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Distilling the Knowledge from Normalizing Flows

Jun 25, 2021
Dmitry Baranchuk, Vladimir Aliev, Artem Babenko

Normalizing flows are a powerful class of generative models demonstrating strong performance in several speech and vision problems. In contrast to other generative models, normalizing flows are latent variable models with tractable likelihoods and allow for stable training. However, they have to be carefully designed to represent invertible functions with efficient Jacobian determinant calculation. In practice, these requirements lead to overparameterized and sophisticated architectures that are inferior to alternative feed-forward models in terms of inference time and memory consumption. In this work, we investigate whether one can distill flow-based models into more efficient alternatives. We provide a positive answer to this question by proposing a simple distillation approach and demonstrating its effectiveness on state-of-the-art conditional flow-based models for image super-resolution and speech synthesis.

* ICML Workshop: INNF+2021 (Spotlight) 

  Access Paper or Ask Questions

A New Dataset and Proposed Convolutional Neural Network Architecture for Classification of American Sign Language Digits

Nov 16, 2020
Arda Mavi

In our interviews with people who work with speech impaired persons, we learned that speech impaired people have difficulties in communicating with other people around them who do not know the sign language, and this situation may cause them to isolate themselves from society and lose their sense of independence. With this paper, to increase the quality of life of individuals with facilitating communication between individuals who use sign language and who do not know this language, we created a new American Sign Language (ASL) digits dataset that can help to create machine learning algorithms which need to large and varied data to be successful, we published this dataset as Sign Language Digits Dataset on Kaggle Datasets web page, we present a proposal Convolutional Neural Network (CNN) architecture that can get 98% test accuracy on our dataset, and we compared it with popular CNN models.

* 4 pages, 1 figure 

  Access Paper or Ask Questions

Xiaomingbot: A Multilingual Robot News Reporter

Jul 12, 2020
Runxin Xu, Jun Cao, Mingxuan Wang, Jiaze Chen, Hao Zhou, Ying Zeng, Yuping Wang, Li Chen, Xiang Yin, Xijin Zhang, Songcheng Jiang, Yuxuan Wang, Lei Li

This paper proposes the building of Xiaomingbot, an intelligent, multilingual and multimodal software robot equipped with four integral capabilities: news generation, news translation, news reading and avatar animation. Its system summarizes Chinese news that it automatically generates from data tables. Next, it translates the summary or the full article into multiple languages, and reads the multilingual rendition through synthesized speech. Notably, Xiaomingbot utilizes a voice cloning technology to synthesize the speech trained from a real person's voice data in one input language. The proposed system enjoys several merits: it has an animated avatar, and is able to generate and read multilingual news. Since it was put into practice, Xiaomingbot has written over 600,000 articles, and gained over 150,000 followers on social media platforms.

* Accepted to ACL 2020 - system demonstration 

  Access Paper or Ask Questions

Broadband DOA estimation using Convolutional neural networks trained with noise signals

Dec 12, 2017
Soumitro Chakrabarty, Emanuël. A. P. Habets

A convolution neural network (CNN) based classification method for broadband DOA estimation is proposed, where the phase component of the short-time Fourier transform coefficients of the received microphone signals are directly fed into the CNN and the features required for DOA estimation are learnt during training. Since only the phase component of the input is used, the CNN can be trained with synthesized noise signals, thereby making the preparation of the training data set easier compared to using speech signals. Through experimental evaluation, the ability of the proposed noise trained CNN framework to generalize to speech sources is demonstrated. In addition, the robustness of the system to noise, small perturbations in microphone positions, as well as its ability to adapt to different acoustic conditions is investigated using experiments with simulated and real data.

* Published in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2017 

  Access Paper or Ask Questions

What Drives the International Development Agenda? An NLP Analysis of the United Nations General Debate 1970-2016

Aug 19, 2017
Alexander Baturo, Niheer Dasandi, Slava J. Mikhaylov

There is surprisingly little known about agenda setting for international development in the United Nations (UN) despite it having a significant influence on the process and outcomes of development efforts. This paper addresses this shortcoming using a novel approach that applies natural language processing techniques to countries' annual statements in the UN General Debate. Every year UN member states deliver statements during the General Debate on their governments' perspective on major issues in world politics. These speeches provide invaluable information on state preferences on a wide range of issues, including international development, but have largely been overlooked in the study of global politics. This paper identifies the main international development topics that states raise in these speeches between 1970 and 2016, and examine the country-specific drivers of international development rhetoric.

  Access Paper or Ask Questions

Tongue contour extraction from ultrasound images based on deep neural network

May 19, 2016
Aurore Jaumard-Hakoun, Kele Xu, Pierre Roussel-Ragot, Gérard Dreyfus, Bruce Denby

Studying tongue motion during speech using ultrasound is a standard procedure, but automatic ultrasound image labelling remains a challenge, as standard tongue shape extraction methods typically require human intervention. This article presents a method based on deep neural networks to automatically extract tongue contour from ultrasound images on a speech dataset. We use a deep autoencoder trained to learn the relationship between an image and its related contour, so that the model is able to automatically reconstruct contours from the ultrasound image alone. In this paper, we use an automatic labelling algorithm instead of time-consuming hand-labelling during the training process, and estimate the performances of both automatic labelling and contour extraction as compared to hand-labelling. Observed results show quality scores comparable to the state of the art.

* 5 pages, 3 figures, published in The International Congress of Phonetic Sciences, 2015 

  Access Paper or Ask Questions

VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

Mar 08, 2022
Juan F. Montesinos, Venkatesh S. Kadandale, Gloria Haro

This paper presents an audio-visual approach for voice separation which outperforms state-of-the-art methods at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced with an audio-only network. We present different ablation studies and comparison to state-of-the-art methods. Finally, we explore the transferability of models trained for speech separation in the task of singing voice separation. The demos, code, and weights will be made publicly available at

  Access Paper or Ask Questions

Real-Time Neural Voice Camouflage

Dec 14, 2021
Mia Chiquier, Chengzhi Mao, Carl Vondrick

Automatic speech recognition systems have created exciting possibilities for applications, however they also enable opportunities for systematic eavesdropping. We propose a method to camouflage a person's voice over-the-air from these systems without inconveniencing the conversation between people in the room. Standard adversarial attacks are not effective in real-time streaming situations because the characteristics of the signal will have changed by the time the attack is executed. We introduce predictive attacks, which achieve real-time performance by forecasting the attack that will be the most effective in the future. Under real-time constraints, our method jams the established speech recognition system DeepSpeech 4.17x more than baselines as measured through word error rate, and 7.27x more as measured through character error rate. We furthermore demonstrate our approach is practically effective in realistic environments over physical distances.

* 14 pages 

  Access Paper or Ask Questions

Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices

Apr 02, 2021
Gonçalo Mordido, Matthijs Van Keirsbilck, Alexander Keller

We demonstrate that 1x1-convolutions in 1D time-channel separable convolutions may be replaced by constant, sparse random ternary matrices with weights in $\{-1,0,+1\}$. Such layers do not perform any multiplications and do not require training. Moreover, the matrices may be generated on the chip during computation and therefore do not require any memory access. With the same parameter budget, we can afford deeper and more expressive models, improving the Pareto frontiers of existing models on several tasks. For command recognition on Google Speech Commands v1, we improve the state-of-the-art accuracy from $97.21\%$ to $97.41\%$ at the same network size. Alternatively, we can lower the cost of existing models. For speech recognition on Librispeech, we half the number of weights to be trained while only sacrificing about $1\%$ of the floating-point baseline's word error rate.

  Access Paper or Ask Questions