Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pai Zhu

LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting

May 29, 2025

Pai Zhu, Quan Wang, Dhruuv Agarwal, Kurt Partridge

Figure 1 for LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting

Figure 2 for LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting

Figure 3 for LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting

Figure 4 for LLM-Synth4KWS: Scalable Automatic Generation and Synthesis of Confusable Data for Custom Keyword Spotting

Abstract:Custom keyword spotting (KWS) allows detecting user-defined spoken keywords from streaming audio. This is achieved by comparing the embeddings from voice enrollments and input audio. State-of-the-art custom KWS models are typically trained contrastively using utterances whose keywords are randomly sampled from training dataset. These KWS models often struggle with confusing keywords, such as "blue" versus "glue". This paper introduces an effective way to augment the training with confusable utterances where keywords are generated and grouped from large language models (LLMs), and speech signals are synthesized with diverse speaking styles from text-to-speech (TTS) engines. To better measure user experience on confusable KWS, we define a new northstar metric using the average area under DET curve from confusable groups (c-AUC). Featuring high scalability and zero labor cost, the proposed method improves AUC by 3.7% and c-AUC by 11.3% on the Speech Commands testing set.

Via

Access Paper or Ask Questions

GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples

May 20, 2025

Harry Zhang, Kurt Partridge, Pai Zhu, Neng Chen, Hyun Jin Park, Dhruuv Agarwal, Quan Wang

Figure 1 for GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples

Figure 2 for GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples

Figure 3 for GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples

Figure 4 for GraphemeAug: A Systematic Approach to Synthesized Hard Negative Keyword Spotting Examples

Abstract:Spoken Keyword Spotting (KWS) is the task of distinguishing between the presence and absence of a keyword in audio. The accuracy of a KWS model hinges on its ability to correctly classify examples close to the keyword and non-keyword boundary. These boundary examples are often scarce in training data, limiting model performance. In this paper, we propose a method to systematically generate adversarial examples close to the decision boundary by making insertion/deletion/substitution edits on the keyword's graphemes. We evaluate this technique on held-out data for a popular keyword and show that the technique improves AUC on a dataset of synthetic hard negatives by 61% while maintaining quality on positives and ambient negative audio data.

* Accepted at Interspeech 2025

Via

Access Paper or Ask Questions

GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-shot Keyword Spotting

Oct 22, 2024

Pai Zhu, Jacob W. Bartel, Dhruuv Agarwal, Kurt Partridge, Hyun Jin Park, Quan Wang

Figure 1 for GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-shot Keyword Spotting

Figure 2 for GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-shot Keyword Spotting

Figure 3 for GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-shot Keyword Spotting

Figure 4 for GE2E-KWS: Generalized End-to-End Training and Evaluation for Zero-shot Keyword Spotting

Abstract:We propose GE2E-KWS -- a generalized end-to-end training and evaluation framework for customized keyword spotting. Specifically, enrollment utterances are separated and grouped by keywords from the training batch and their embedding centroids are compared to all other test utterance embeddings to compute the loss. This simulates runtime enrollment and verification stages, and improves convergence stability and training speed by optimizing matrix operations compared to SOTA triplet loss approaches. To benchmark different models reliably, we propose an evaluation process that mimics the production environment and compute metrics that directly measure keyword matching accuracy. Trained with GE2E loss, our 419KB quantized conformer model beats a 7.5GB ASR encoder by 23.6% relative AUC, and beats a same size triplet loss model by 60.7% AUC. Our KWS models are natively streamable with low memory footprints, and designed to continuously run on-device with no retraining needed for new keywords (zero-shot).

* 8 pages, 6 figures, 2 tables The paper is accepted in IEEE Spoken Language Technology (SLT) 2024

Via

Access Paper or Ask Questions

Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

Aug 20, 2024

Hyun Jin Park, Dhruuv Agarwal, Neng Chen, Rentao Sun, Kurt Partridge, Justin Chen, Harry Zhang, Pai Zhu, Jacob Bartel, Kyle Kastner(+3 more)

Figure 1 for Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

Figure 2 for Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

Figure 3 for Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

Figure 4 for Adversarial training of Keyword Spotting to Minimize TTS Data Overfitting

Abstract:The keyword spotting (KWS) problem requires large amounts of real speech training data to achieve high accuracy across diverse populations. Utilizing large amounts of text-to-speech (TTS) synthesized data can reduce the cost and time associated with KWS development. However, TTS data may contain artifacts not present in real speech, which the KWS model can exploit (overfit), leading to degraded accuracy on real speech. To address this issue, we propose applying an adversarial training method to prevent the KWS model from learning TTS-specific features when trained on large amounts of TTS data. Experimental results demonstrate that KWS model accuracy on real speech data can be improved by up to 12% when adversarial loss is used in addition to the original KWS loss. Surprisingly, we also observed that the adversarial setup improves accuracy by up to 8%, even when trained solely on TTS and real negative speech data, without any real positive examples.

* to be published in a Workshop at Interspeech 2024, Synthetic Data's Transformative Role in Foundational Speech Models

Via

Access Paper or Ask Questions

Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model

Jul 26, 2024

Hyun Jin Park, Dhruuv Agarwal, Neng Chen, Rentao Sun, Kurt Partridge, Justin Chen, Harry Zhang, Pai Zhu, Jacob Bartel, Kyle Kastner(+3 more)

Figure 1 for Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model

Figure 2 for Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model

Figure 3 for Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model

Figure 4 for Utilizing TTS Synthesized Data for Efficient Development of Keyword Spotting Model

Abstract:This paper explores the use of TTS synthesized training data for KWS (keyword spotting) task while minimizing development cost and time. Keyword spotting models require a huge amount of training data to be accurate, and obtaining such training data can be costly. In the current state of the art, TTS models can generate large amounts of natural-sounding data, which can help reducing cost and time for KWS model development. Still, TTS generated data can be lacking diversity compared to real data. To pursue maximizing KWS model accuracy under the constraint of limited resources and current TTS capability, we explored various strategies to mix TTS data and real human speech data, with a focus on minimizing real data use and maximizing diversity of TTS output. Our experimental results indicate that relatively small amounts of real audio data with speaker diversity (100 speakers, 2k utterances) and large amounts of TTS synthesized data can achieve reasonably high accuracy (within 3x error rate of baseline), compared to the baseline (trained with 3.8M real positive utterances).

* to be published in a Workshop at Interspeech 2024, Synthetic Data's Transformative Role in Foundational Speech Models

Via

Access Paper or Ask Questions

Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments

Jul 23, 2024

Pai Zhu, Dhruuv Agarwal, Jacob W. Bartel, Kurt Partridge, Hyun Jin Park, Quan Wang

Figure 1 for Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments

Figure 2 for Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments

Figure 3 for Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments

Figure 4 for Synth4Kws: Synthesized Speech for User Defined Keyword Spotting in Low Resource Environments

Abstract:One of the challenges in developing a high quality custom keyword spotting (KWS) model is the lengthy and expensive process of collecting training data covering a wide range of languages, phrases and speaking styles. We introduce Synth4Kws - a framework to leverage Text to Speech (TTS) synthesized data for custom KWS in different resource settings. With no real data, we found increasing TTS phrase diversity and utterance sampling monotonically improves model performance, as evaluated by EER and AUC metrics over 11k utterances of the speech command dataset. In low resource settings, with 50k real utterances as a baseline, we found using optimal amounts of TTS data can improve EER by 30.1% and AUC by 46.7%. Furthermore, we mix TTS data with varying amounts of real data and interpolate the real data needed to achieve various quality targets. Our experiments are based on English and single word utterances but the findings generalize to i18n languages and other keyword types.

* 5 pages, 5 figures, 2 tables The paper is accepted in Interspeech SynData4GenAI 2024 Workshop - https://syndata4genai.org/#call-for-papers

Via

Access Paper or Ask Questions

Personalizing Keyword Spotting with Speaker Information

Nov 06, 2023

Beltrán Labrador, Pai Zhu, Guanlong Zhao, Angelo Scorza Scarpati, Quan Wang, Alicia Lozano-Diez, Alex Park, Ignacio López Moreno

Figure 1 for Personalizing Keyword Spotting with Speaker Information

Figure 2 for Personalizing Keyword Spotting with Speaker Information

Figure 3 for Personalizing Keyword Spotting with Speaker Information

Figure 4 for Personalizing Keyword Spotting with Speaker Information

Abstract:Keyword spotting systems often struggle to generalize to a diverse population with various accents and age groups. To address this challenge, we propose a novel approach that integrates speaker information into keyword spotting using Feature-wise Linear Modulation (FiLM), a recent method for learning from multiple sources of information. We explore both Text-Dependent and Text-Independent speaker recognition systems to extract speaker information, and we experiment on extracting this information from both the input audio and pre-enrolled user audio. We evaluate our systems on a diverse dataset and achieve a substantial improvement in keyword detection accuracy, particularly among underrepresented speaker groups. Moreover, our proposed approach only requires a small 1% increase in the number of parameters, with a minimum impact on latency and computational cost, which makes it a practical solution for real-world applications.

Via

Access Paper or Ask Questions

Locale Encoding For Scalable Multilingual Keyword Spotting Models

Feb 25, 2023

Pai Zhu, Hyun Jin Park, Alex Park, Angelo Scorza Scarpati, Ignacio Lopez Moreno

Figure 1 for Locale Encoding For Scalable Multilingual Keyword Spotting Models

Figure 2 for Locale Encoding For Scalable Multilingual Keyword Spotting Models

Figure 3 for Locale Encoding For Scalable Multilingual Keyword Spotting Models

Figure 4 for Locale Encoding For Scalable Multilingual Keyword Spotting Models

Abstract:A Multilingual Keyword Spotting (KWS) system detects spokenkeywords over multiple locales. Conventional monolingual KWSapproaches do not scale well to multilingual scenarios because ofhigh development/maintenance costs and lack of resource sharing.To overcome this limit, we propose two locale-conditioned universalmodels with locale feature concatenation and feature-wise linearmodulation (FiLM). We compare these models with two baselinemethods: locale-specific monolingual KWS, and a single universalmodel trained over all data. Experiments over 10 localized languagedatasets show that locale-conditioned models substantially improveaccuracy over baseline methods across all locales in different noiseconditions.FiLMperformed the best, improving on average FRRby 61% (relative) compared to monolingual KWS models of similarsizes.

* Accepted for ICASSP 2023

Via

Access Paper or Ask Questions

Noisy student-teacher training for robust keyword spotting

Jun 03, 2021

Hyun-Jin Park, Pai Zhu, Ignacio Lopez Moreno, Niranjan Subrahmanya

Figure 1 for Noisy student-teacher training for robust keyword spotting

Figure 2 for Noisy student-teacher training for robust keyword spotting

Figure 3 for Noisy student-teacher training for robust keyword spotting

Figure 4 for Noisy student-teacher training for robust keyword spotting

Abstract:We propose self-training with noisy student-teacher approach for streaming keyword spotting, that can utilize large-scale unlabeled data and aggressive data augmentation. The proposed method applies aggressive data augmentation (spectral augmentation) on the input of both student and teacher and utilize unlabeled data at scale, which significantly boosts the accuracy of student against challenging conditions. Such aggressive augmentation usually degrades model performance when used with supervised training with hard-labeled data. Experiments show that aggressive spec augmentation on baseline supervised training method degrades accuracy, while the proposed self-training with noisy student-teacher training improves accuracy of some difficult-conditioned test sets by as much as 60%.

Via

Access Paper or Ask Questions

Training Keyword Spotting Models on Non-IID Data with Federated Learning

Jun 04, 2020

Andrew Hard, Kurt Partridge, Cameron Nguyen, Niranjan Subrahmanya, Aishanee Shah, Pai Zhu, Ignacio Lopez Moreno, Rajiv Mathews

Figure 1 for Training Keyword Spotting Models on Non-IID Data with Federated Learning

Figure 2 for Training Keyword Spotting Models on Non-IID Data with Federated Learning

Figure 3 for Training Keyword Spotting Models on Non-IID Data with Federated Learning

Figure 4 for Training Keyword Spotting Models on Non-IID Data with Federated Learning

Abstract:We demonstrate that a production-quality keyword-spotting model can be trained on-device using federated learning and achieve comparable false accept and false reject rates to a centrally-trained model. To overcome the algorithmic constraints associated with fitting on-device data (which are inherently non-independent and identically distributed), we conduct thorough empirical studies of optimization algorithms and hyperparameter configurations using large-scale federated simulations. To overcome resource constraints, we replace memory intensive MTR data augmentation with SpecAugment, which reduces the false reject rate by 56%. Finally, to label examples (given the zero visibility into on-device data), we explore teacher-student training.

* Submitted to Interspeech 2020

Via

Access Paper or Ask Questions