Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Lattice-based Improvements for Voice Triggering Using Graph Neural Networks

Jan 25, 2020
Pranay Dighe, Saurabh Adya, Nuoyu Li, Srikanth Vishnubhotla, Devang Naik, Adithya Sagar, Ying Ma, Stephen Pulman, Jason Williams

Voice-triggered smart assistants often rely on detection of a trigger-phrase before they start listening for the user request. Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant. In this paper, we address the task of false trigger mitigation (FTM) using a novel approach based on analyzing automatic speech recognition (ASR) lattices using graph neural networks (GNN). The proposed approach uses the fact that decoding lattice of a falsely triggered audio exhibits uncertainties in terms of many alternative paths and unexpected words on the lattice arcs as compared to the lattice of a correctly triggered audio. A pure trigger-phrase detector model doesn't fully utilize the intent of the user speech whereas by using the complete decoding lattice of user audio, we can effectively mitigate speech not intended for the smart assistant. We deploy two variants of GNNs in this paper based on 1) graph convolution layers and 2) self-attention mechanism respectively. Our experiments demonstrate that GNNs are highly accurate in FTM task by mitigating ~87% of false triggers at 99% true positive rate (TPR). Furthermore, the proposed models are fast to train and efficient in parameter requirements.

  Access Paper or Ask Questions

Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting

Aug 02, 2018
Zhehuai Chen, Yanmin Qian, Kai Yu

Speech recognition is a sequence prediction problem. Besides employing various deep learning approaches for framelevel classification, sequence-level discriminative training has been proved to be indispensable to achieve the state-of-the-art performance in large vocabulary continuous speech recognition (LVCSR). However, keyword spotting (KWS), as one of the most common speech recognition tasks, almost only benefits from frame-level deep learning due to the difficulty of getting competing sequence hypotheses. The few studies on sequence discriminative training for KWS are limited for fixed vocabulary or LVCSR based methods and have not been compared to the state-of-the-art deep learning based KWS approaches. In this paper, a sequence discriminative training framework is proposed for both fixed vocabulary and unrestricted acoustic KWS. Sequence discriminative training for both sequence-level generative and discriminative models are systematically investigated. By introducing word-independent phone lattices or non-keyword blank symbols to construct competing hypotheses, feasible and efficient sequence discriminative training approaches are proposed for acoustic KWS. Experiments showed that the proposed approaches obtained consistent and significant improvement in both fixed vocabulary and unrestricted KWS tasks, compared to previous frame-level deep learning based acoustic KWS methods.

* Speech Communication, vol. 102, 100-111, 2018 
* accepted by Speech Communication, 08/02/2018 

  Access Paper or Ask Questions

The VoicePrivacy 2022 Challenge Evaluation Plan

Mar 27, 2022
Natalia Tomashenko, Xin Wang, Xiaoxiao Miao, Hubert Nourtel, Pierre Champion, Massimiliano Todisco, Emmanuel Vincent, Nicholas Evans, Junichi Yamagishi, Jean-François Bonastre

For new participants - Executive summary: (1) The task is to develop a voice anonymization system for speech data which conceals the speaker's voice identity while protecting linguistic content, paralinguistic attributes, intelligibility and naturalness. (2) Training, development and evaluation datasets are provided in addition to 3 different baseline anonymization systems, evaluation scripts, and metrics. Participants apply their developed anonymization systems, run evaluation scripts and submit objective evaluation results and anonymized speech data to the organizers. (3) Results will be presented at a workshop held in conjunction with INTERSPEECH 2022 to which all participants are invited to present their challenge systems and to submit additional workshop papers. For readers familiar with the VoicePrivacy Challenge - Changes w.r.t. 2020: (1) A stronger, semi-informed attack model in the form of an automatic speaker verification (ASV) system trained on anonymized (per-utterance) speech data. (2) Complementary metrics comprising the equal error rate (EER) as a privacy metric, the word error rate (WER) as a primary utility metric, and the pitch correlation and gain of voice distinctiveness as secondary utility metrics. (3) A new ranking policy based upon a set of minimum target privacy requirements.

* the file is unchanged; minor correction in metadata 

  Access Paper or Ask Questions

Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

Feb 24, 2022
Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ian McGraw

VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a dual learning rate schedule and by using feature-wise linear modulation (FiLM) to condition the model with the attended speaker embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.

  Access Paper or Ask Questions

Teaching keyword spotters to spot new keywords with limited examples

Jun 04, 2021
Abhijeet Awasthi, Kevin Kilgour, Hassan Rom

Learning to recognize new keywords with just a few examples is essential for personalizing keyword spotting (KWS) models to a user's choice of keywords. However, modern KWS models are typically trained on large datasets and restricted to a small vocabulary of keywords, limiting their transferability to a broad range of unseen keywords. Towards easily customizable KWS models, we present KeySEM (Keyword Speech EMbedding), a speech embedding model pre-trained on the task of recognizing a large number of keywords. Speech representations offered by KeySEM are highly effective for learning new keywords from a limited number of examples. Comparisons with a diverse range of related work across several datasets show that our method achieves consistently superior performance with fewer training examples. Although KeySEM was pre-trained only on English utterances, the performance gains also extend to datasets from four other languages indicating that KeySEM learns useful representations well aligned with the task of keyword spotting. Finally, we demonstrate KeySEM's ability to learn new keywords sequentially without requiring to re-train on previously learned keywords. Our experimental observations suggest that KeySEM is well suited to on-device environments where post-deployment learning and ease of customization are often desirable.


  Access Paper or Ask Questions

Multi-Scale Attention Neural Network for Acoustic Echo Cancellation

May 31, 2021
Lu Ma, Song Yang, Yaguang Gong, Zhongqin Wu

Acoustic Echo Cancellation (AEC) plays a key role in speech interaction by suppressing the echo received at microphone introduced by acoustic reverberations from loudspeakers. Since the performance of linear adaptive filter (AF) would degrade severely due to nonlinear distortions, background noises, and microphone clipping in real scenarios, deep learning has been employed for AEC for its good nonlinear modelling ability. In this paper, we constructed an end-to-end multi-scale attention neural network for AEC. Temporal convolution is first used to transform waveform into spectrogram. The spectrograms of the far-end reference and the near-end mixture are concatenated, and fed to a temporal convolution network (TCN) with stacked dilated convolution layers. Attention mechanism is performed among these representations from different layers to adaptively extract relevant features by referring to the previous hidden state in the encoder long short-term memory (LSTM) unit. The representations are weighted averaged and fed to the encoder LSTM for the near-end speech estimation. Experiments show the superiority of our method in terms of the echo return loss enhancement (ERLE) for single-talk periods and the perceptual evaluation of speech quality (PESQ) score for double-talk periods in background noise and nonlinear distortion scenarios.

* 5 pages, 3 figures, 4 tables. arXiv admin note: substantial text overlap with arXiv:2105.14666 

  Access Paper or Ask Questions

Best-first Model Merging for Hidden Markov Model Induction

May 10, 1994
Andreas Stolcke, Stephen M. Omohundro

This report describes a new technique for inducing the structure of Hidden Markov Models from data which is based on the general `model merging' strategy (Omohundro 1992). The process begins with a maximum likelihood HMM that directly encodes the training data. Successively more general models are produced by merging HMM states. A Bayesian posterior probability criterion is used to determine which states to merge and when to stop generalizing. The procedure may be considered a heuristic search for the HMM structure with the highest posterior probability. We discuss a variety of possible priors for HMMs, as well as a number of approximations which improve the computational efficiency of the algorithm. We studied three applications to evaluate the procedure. The first compares the merging algorithm with the standard Baum-Welch approach in inducing simple finite-state languages from small, positive-only training samples. We found that the merging procedure is more robust and accurate, particularly with a small amount of training data. The second application uses labelled speech data from the TIMIT database to build compact, multiple-pronunciation word models that can be used in speech recognition. Finally, we describe how the algorithm was incorporated in an operational speech understanding system, where it is combined with neural network acoustic likelihood estimators to improve performance over single-pronunciation word models.

* 63 pages 

  Access Paper or Ask Questions

Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings

Jan 06, 2021
Xuankai Chang, Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka

An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch between the training and testing conditions. It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training. In this work, we first apply a known decoding technique that was developed to perform single-speaker ASR for long-form audio to our E2E SA-ASR task. Then, we propose a novel method using a sequence-to-sequence model, called hypothesis stitcher. The model takes multiple hypotheses obtained from short audio segments that are extracted from the original long-form input, and it then outputs a fused single hypothesis. We propose several architectural variations of the hypothesis stitcher model and compare them with the conventional decoding methods. Experiments using LibriSpeech and LibriCSS corpora show that the proposed method significantly improves SA-WER especially for long-form multi-talker recordings.

* Submitted to ICASSP 2021 

  Access Paper or Ask Questions

Whither the Priors for (Vocal) Interactivity?

Mar 16, 2022
Roger K. Moore

Voice-based communication is often cited as one of the most `natural' ways in which humans and robots might interact, and the recent availability of accurate automatic speech recognition and intelligible speech synthesis has enabled researchers to integrate advanced off-the-shelf spoken language technology components into their robot platforms. Despite this, the resulting interactions are anything but `natural'. It transpires that simply giving a robot a voice doesn't mean that a user will know how (or when) to talk to it, and the resulting `conversations' tend to be stilted, one-sided and short. On the surface, these difficulties might appear to be fairly trivial consequences of users' unfamiliarity with robots (and \emph{vice versa}), and that any problems would be mitigated by long-term use by the human, coupled with `deep learning' by the robot. However, it is argued here that such communication failures are indicative of a deeper malaise: a fundamental lack of basic principles -- \emph{priors} -- underpinning not only speech-based interaction in particular, but (vocal) interactivity in general. This is evidenced not only by the fact that contemporary spoken language systems already require training data sets that are orders-of-magnitude greater than that experienced by a young child, but also by the lack of design principles for creating effective communicative human-robot interaction. This short position paper identifies some of the key areas where theoretical insights might help overcome these shortfalls.

* Accepted for the THEORIA Workshop "Towards a Common Understanding and Vision for Theory-Grounded Human-Robot Interaction" at HRI-2022, 7 March 2022 

  Access Paper or Ask Questions

Estimating Presentation Competence using Multimodal Nonverbal Behavioral Cues

May 06, 2021
Ömer Sümer, Cigdem Beyan, Fabian Ruth, Olaf Kramer, Ulrich Trautwein, Enkelejda Kasneci

Public speaking and presentation competence plays an essential role in many areas of social interaction in our educational, professional, and everyday life. Since our intention during a speech can differ from what is actually understood by the audience, the ability to appropriately convey our message requires a complex set of skills. Presentation competence is cultivated in the early school years and continuously developed over time. One approach that can promote efficient development of presentation competence is the automated analysis of human behavior during a speech based on visual and audio features and machine learning. Furthermore, this analysis can be used to suggest improvements and the development of skills related to presentation competence. In this work, we investigate the contribution of different nonverbal behavioral cues, namely, facial, body pose-based, and audio-related features, to estimate presentation competence. The analyses were performed on videos of 251 students while the automated assessment is based on manual ratings according to the T\"ubingen Instrument for Presentation Competence (TIP). Our classification results reached the best performance with early fusion in the same dataset evaluation (accuracy of 71.25%) and late fusion of speech, face, and body pose features in the cross dataset evaluation (accuracy of 78.11%). Similarly, regression results performed the best with fusion strategies.

  Access Paper or Ask Questions