Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

FlowSense: Monitoring Airflow in Building Ventilation Systems Using Audio Sensing

Feb 22, 2022
Bhawana Chhaglani, Camellia Zakaria, Adam Lechowicz, Prashant Shenoy, Jeremy Gummeson

Proper indoor ventilation through buildings' heating, ventilation, and air conditioning (HVAC) systems has become an increasing public health concern that significantly impacts individuals' health and safety at home, work, and school. While much work has progressed in providing energy-efficient and user comfort for HVAC systems through IoT devices and mobile-sensing approaches, ventilation is an aspect that has received lesser attention despite its importance. With a motivation to monitor airflow from building ventilation systems through commodity sensing devices, we present FlowSense, a machine learning-based algorithm to predict airflow rate from sensed audio data in indoor spaces. Our ML technique can predict the state of an air vent-whether it is on or off-as well as the rate of air flowing through active vents. By exploiting a low-pass filter to obtain low-frequency audio signals, we put together a privacy-preserving pipeline that leverages a silence detection algorithm to only sense for sounds of air from HVAC air vent when no human speech is detected. We also propose the Minimum Persistent Sensing (MPS) as a post-processing algorithm to reduce interference from ambient noise, including ongoing human conversation, office machines, and traffic noises. Together, these techniques ensure user privacy and improve the robustness of FlowSense. We validate our approach yielding over 90% accuracy in predicting vent status and 0.96 MSE in predicting airflow rate when the device is placed within 2.25 meters away from an air vent. Additionally, we demonstrate how our approach as a mobile audio-sensing platform is robust to smartphone models, distance, and orientation. Finally, we evaluate FlowSense privacy-preserving pipeline through a user study and a Google Speech Recognition service, confirming that the audio signals we used as input data are inaudible and inconstructible.

* 26 pages, 12 figures, Will appear in March issue of the IMWUT 2022 journal 

  Access Paper or Ask Questions

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning

Dec 06, 2021
Suzhen Wang, Lincheng Li, Yu Ding, Xin Yu

Audio-driven one-shot talking face generation methods are usually trained on video resources of various persons. However, their created videos often suffer unnatural mouth shapes and asynchronous lips because those methods struggle to learn a consistent speech style from different speakers. We observe that it would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements. Hence, we propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker and then transferring audio-driven motion fields to a reference image. Specifically, we develop an Audio-Visual Correlation Transformer (AVCT) that aims to infer talking motions represented by keypoint based dense motion fields from an input audio. In particular, considering audio may come from different identities in deployment, we incorporate phonemes to represent audio signals. In this manner, our AVCT can inherently generalize to audio spoken by other identities. Moreover, as face keypoints are used to represent speakers, AVCT is agnostic against appearances of the training speaker, and thus allows us to manipulate face images of different identities readily. Considering different face shapes lead to different motions, a motion field transfer module is exploited to reduce the audio-driven dense motion field gap between the training identity and the one-shot reference. Once we obtained the dense motion field of the reference image, we employ an image renderer to generate its talking face videos from an audio clip. Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements. Extensive experiments demonstrate that our synthesized videos outperform the state-of-the-art in terms of visual quality and lip-sync.

* AAAI 2022 
* Accepted by AAAI 2022 

  Access Paper or Ask Questions

Advancing the State of the Art in Open Domain Dialog Systems through the Alexa Prize

Dec 27, 2018
Chandra Khatri, Behnam Hedayatnia, Anu Venkatesh, Jeff Nunn, Yi Pan, Qing Liu, Han Song, Anna Gottardi, Sanjeev Kwatra, Sanju Pancholi, Ming Cheng, Qinglang Chen, Lauren Stubel, Karthik Gopalakrishnan, Kate Bland, Raefer Gabriel, Arindam Mandal, Dilek Hakkani-Tur, Gene Hwang, Nate Michel, Eric King, Rohit Prasad

Building open domain conversational systems that allow users to have engaging conversations on topics of their choice is a challenging task. Alexa Prize was launched in 2016 to tackle the problem of achieving natural, sustained, coherent and engaging open-domain dialogs. In the second iteration of the competition in 2018, university teams advanced the state of the art by using context in dialog models, leveraging knowledge graphs for language understanding, handling complex utterances, building statistical and hierarchical dialog managers, and leveraging model-driven signals from user responses. The 2018 competition also included the provision of a suite of tools and models to the competitors including the CoBot (conversational bot) toolkit, topic and dialog act detection models, conversation evaluators, and a sensitive content detection model so that the competing teams could focus on building knowledge-rich, coherent and engaging multi-turn dialog systems. This paper outlines the advances developed by the university teams as well as the Alexa Prize team to achieve the common goal of advancing the science of Conversational AI. We address several key open-ended problems such as conversational speech recognition, open domain natural language understanding, commonsense reasoning, statistical dialog management, and dialog evaluation. These collaborative efforts have driven improved experiences by Alexa users to an average rating of 3.61, the median duration of 2 mins 18 seconds, and average turns to 14.6, increases of 14%, 92%, 54% respectively since the launch of the 2018 competition. For conversational speech recognition, we have improved our relative Word Error Rate by 55% and our relative Entity Error Rate by 34% since the launch of the Alexa Prize. Socialbots improved in quality significantly more rapidly in 2018, in part due to the release of the CoBot toolkit.

* 2018 Alexa Prize Proceedings 

  Access Paper or Ask Questions

Multi Modal Adaptive Normalization for Audio to Video Generation

Dec 14, 2020
Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall

Speech-driven facial video generation has been a complex problem due to its multi-modal aspects namely audio and video domain. The audio comprises lots of underlying features such as expression, pitch, loudness, prosody(speaking style) and facial video has lots of variability in terms of head movement, eye blinks, lip synchronization and movements of various facial action units along with temporal smoothness. Synthesizing highly expressive facial videos from the audio input and static image is still a challenging task for generative adversarial networks. In this paper, we propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person. The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components and hence generates a highly expressive talking-head video of the given person. The multi-modal adaptive normalization uses the various features of audio and video such as Mel spectrogram, pitch, energy from audio signals and predicted keypoint heatmap/optical flow and a single image to learn the respective affine parameters to generate highly expressive video. Experimental evaluation demonstrates superior performance of the proposed method as compared to Realistic Speech-Driven Facial Animation with GANs(RSDGAN) [53], Speech2Vid [10], and other approaches, on multiple quantitative metrics including: SSIM (structural similarity index), PSNR (peak signal to noise ratio), CPBD (image sharpness), WER(word error rate), blinks/sec and LMD(landmark distance). Further, qualitative evaluation and Online Turing tests demonstrate the efficacy of our approach.

  Access Paper or Ask Questions

Racism is a Virus: Anti-Asian Hate and Counterhate in Social Media during the COVID-19 Crisis

May 25, 2020
Caleb Ziems, Bing He, Sandeep Soni, Srijan Kumar

The spread of COVID-19 has sparked racism, hate, and xenophobia in social media targeted at Chinese and broader Asian communities. However, little is known about how racial hate spreads during a pandemic and the role of counterhate speech in mitigating the spread. Here we study the evolution and spread of anti-Asian hate speech through the lens of Twitter. We create COVID-HATE, the largest dataset of anti-Asian hate and counterhate spanning three months, containing over 30 million tweets, and a social network with over 87 million nodes. By creating a novel hand-labeled dataset of 2,400 tweets, we train a text classifier to identify hate and counterhate tweets that achieves an average AUROC of 0.852. We identify 891,204 hate and 200,198 counterhate tweets in COVID-HATE. Using this data to conduct longitudinal analysis, we find that while hateful users are less engaged in the COVID-19 discussions prior to their first anti-Asian tweet, they become more vocal and engaged afterwards compared to counterhate users. We find that bots comprise 10.4% of hateful users and are more vocal and hateful compared to non-bot users. Comparing bot accounts, we show that hateful bots are more successful in attracting followers compared to counterhate bots. Analysis of the social network reveals that hateful and counterhate users interact and engage extensively with one another, instead of living in isolated polarized communities. Furthermore, we find that hate is contagious and nodes are highly likely to become hateful after being exposed to hateful content. Importantly, our analysis reveals that counterhate messages can discourage users from turning hateful in the first place. Overall, this work presents a comprehensive overview of anti-Asian hate and counterhate content during a pandemic. The COVID-HATE dataset is available at

* The COVID-HATE dataset, classifier, and demo are available at 

  Access Paper or Ask Questions

Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks

Jan 08, 2022
Shoukang Hu, Xurong Xie, Mingyu Cui, Jiajun Deng, Shansong Liu, Jianwei Yu, Mengzhe Geng, Xunying Liu, Helen Meng

State-of-the-art automatic speech recognition (ASR) system development is data and computation intensive. The optimal design of deep neural networks (DNNs) for these systems often require expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of factored time delay neural networks (TDNN-Fs): i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer. These techniques include the differentiable neural architecture search (DARTS) method integrating architecture learning with lattice-free MMI training; Gumbel-Softmax and pipelined DARTS methods reducing the confusion over candidate architectures and improving the generalization of architecture selection; and Penalized DARTS incorporating resource constraints to balance the trade-off between performance and system complexity. Parameter sharing among TDNN-F architectures allows an efficient search over up to 7^28 different systems. Statistically significant word error rate (WER) reductions of up to 1.2% absolute and relative model size reduction of 31% were obtained over a state-of-the-art 300-hour Switchboard corpus trained baseline LF-MMI TDNN-F system featuring speed perturbation, i-Vector and learning hidden unit contribution (LHUC) based speaker adaptation as well as RNNLM rescoring. Performance contrasts on the same task against recent end-to-end systems reported in the literature suggest the best NAS auto-configured system achieves state-of-the-art WERs of 9.9% and 11.1% on the NIST Hub5' 00 and Rt03s test sets respectively with up to 96% model size reduction. Further analysis using Bayesian learning shows that the proposed NAS approaches can effectively minimize the structural redundancy in the TDNN-F systems and reduce their model parameter uncertainty. Consistent performance improvements were also obtained on a UASpeech dysarthric speech recognition task.

* Submitted to TASLP. arXiv admin note: text overlap with arXiv:2007.08818 

  Access Paper or Ask Questions

Pagsusuri ng RNN-based Transfer Learning Technique sa Low-Resource Language

Oct 13, 2020
Dan John Velasco

Low-resource languages such as Filipino suffer from data scarcity which makes it challenging to develop NLP applications for Filipino language. The use of Transfer Learning (TL) techniques alleviates this problem in low-resource setting. In recent years, transformer-based models are proven to be effective in low-resource tasks but faces challenges in accessibility due to its high compute and memory requirements. There's a need for a cheaper but effective alternative. This paper has three contributions. First, release a pre-trained AWD LSTM language model for Filipino language. Second, benchmark AWD LSTM in the Hate Speech classification task and show that it performs on par with transformer-based models. Third, analyze the degradation rate of AWD-LSTM to smaller data using degradation test and compare it with transformer-based models. ----- Ang mga low-resource languages tulad ng Filipino ay gipit sa accessible na datos kaya't mahirap gumawa ng mga applications sa wikang ito. Ang mga Transfer Learning (TL) techniques ay malaking tulong para sa mga pagkakataong gipit tayo sa datos. Sa mga nagdaang taon, nanaig ang mga transformer-based TL techniques pagdating sa low-resource tasks ngunit ito ay magastos sa resources. Kaya nangangailangan ng mas mura pero epektibong alternatibo. Ang papel na ito ay may tatlong kontribusyon. Una, maglabas ng pre-trained AWD LSTM language model sa wikang Filipino upang maging tuntungan sa pagbuo ng mga NLP applications sa wikang Filipino. Pangalawa, mag benchmark ng AWD LSTM sa Hate Speech classification task at ipakita na kayang nitong makipagsabayan sa mga transformer-based models. Pangatlo, suriin ang degradation rate ng AWD-LSTM sa mas maliit na data gamit ang degradation test at ikumpara ito sa mga transformer-based models.

* 5 pages, 3 tables, 1 figure. in Filipino language 

  Access Paper or Ask Questions

Multiple topic identification in human/human conversations

Dec 29, 2018
X. Bost, G. Senay, M. El-Bèze, R. De Mori

The paper deals with the automatic analysis of real-life telephone conversations between an agent and a customer of a customer care service (ccs). The application domain is the public transportation system in Paris and the purpose is to collect statistics about customer problems in order to monitor the service and decide priorities on the intervention for improving user satisfaction. Of primary importance for the analysis is the detection of themes that are the object of customer problems. Themes are defined in the application requirements and are part of the application ontology that is implicit in the ccs documentation. Due to variety of customer population, the structure of conversations with an agent is unpredictable. A conversation may be about one or more themes. Theme mentions can be interleaved with mentions of facts that are irrelevant for the application purpose. Furthermore, in certain conversations theme mentions are localized in specific conversation segments while in other conversations mentions cannot be localized. As a consequence, approaches to feature extraction with and without mention localization are considered. Application domain relevant themes identified by an automatic procedure are expressed by specific sentences whose words are hypothesized by an automatic speech recognition (asr) system. The asr system is error prone. The word error rates can be very high for many reasons. Among them it is worth mentioning unpredictable background noise, speaker accent, and various types of speech disfluencies. As the application task requires the composition of proportions of theme mentions, a sequential decision strategy is introduced in this paper for performing a survey of the large amount of conversations made available in a given time period. The strategy has to sample the conversations to form a survey containing enough data analyzed with high accuracy so that proportions can be estimated with sufficient accuracy. Due to the unpredictable type of theme mentions, it is appropriate to consider methods for theme hypothesization based on global as well as local feature extraction. Two systems based on each type of feature extraction will be considered by the strategy. One of the four methods is novel. It is based on a new definition of density of theme mentions and on the localization of high density zones whose boundaries do not need to be precisely detected. The sequential decision strategy starts by grouping theme hypotheses into sets of different expected accuracy and coverage levels. For those sets for which accuracy can be improved with a consequent increase of coverage a new system with new features is introduced. Its execution is triggered only when specific preconditions are met on the hypotheses generated by the basic four systems. Experimental results are provided on a corpus collected in the call center of the Paris transportation system known as ratp. The results show that surveys with high accuracy and coverage can be composed with the proposed strategy and systems. This makes it possible to apply a previously published proportion estimation approach that takes into account hypothesization errors .

* Computer Speech \& Language, 2015, 34 (1), pp.18-42 

  Access Paper or Ask Questions

Multilayered Model of Speech

Jan 08, 2018
Andrey Chistyakov

Human speech is the most important part of General Artificial Intelligence and subject of much research. The hypothesis proposed in this article provides explanation of difficulties that modern science tackles in the field of human brain simulation. The hypothesis is based on the author's conviction that the brain of any given person has different ability to process and store information. Therefore, the approaches that are currently used to create General Artificial Intelligence have to be altered.

  Access Paper or Ask Questions