Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thanapat Trachu

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

Mar 08, 2026

Thanapat Trachu, Thanathai Lertpetchpun, Sai Praneeth Karimireddy, Shrikanth Narayanan

Abstract:Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. We formalize this task as Speech Generation Speaker Poisoning (SGSP), in which we modify trained models to prevent the generation of specific identities while preserving utility for other speakers. We evaluate inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers. Performance is assessed through the trade-off between utility (WER) and privacy, quantified using AUC and Forget Speaker Similarity (FSSIM). We achieve strong privacy for up to 15 speakers but reveal scalability limits at 100 speakers due to increased identity overlap. Our study thus introduces a novel problem and evaluation framework toward further advances in generative voice privacy.

* Submitted to Interspeech2026

Via

Access Paper or Ask Questions

Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis

Jan 20, 2026

Thanathai Lertpetchpun, Yoonjeong Lee, Thanapat Trachu, Jihwan Lee, Tiantian Feng, Dani Byrd, Shrikanth Narayanan

Abstract:Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.

* Accepted to ICASSP2026

Via

Access Paper or Ask Questions

Amplifying Artifacts with Speech Enhancement in Voice Anti-spoofing

Jun 13, 2025

Thanapat Trachu, Thanathai Lertpetchpun, Ekapol Chuangsuwanich

Figure 1 for Amplifying Artifacts with Speech Enhancement in Voice Anti-spoofing

Figure 2 for Amplifying Artifacts with Speech Enhancement in Voice Anti-spoofing

Figure 3 for Amplifying Artifacts with Speech Enhancement in Voice Anti-spoofing

Figure 4 for Amplifying Artifacts with Speech Enhancement in Voice Anti-spoofing

Abstract:Spoofed utterances always contain artifacts introduced by generative models. While several countermeasures have been proposed to detect spoofed utterances, most primarily focus on architectural improvements. In this work, we investigate how artifacts remain hidden in spoofed speech and how to enhance their presence. We propose a model-agnostic pipeline that amplifies artifacts using speech enhancement and various types of noise. Our approach consists of three key steps: noise addition, noise extraction, and noise amplification. First, we introduce noise into the raw speech. Then, we apply speech enhancement to extract the entangled noise and artifacts. Finally, we amplify these extracted features. Moreover, our pipeline is compatible with different speech enhancement models and countermeasure architectures. Our method improves spoof detection performance by up to 44.44\% on ASVspoof2019 and 26.34\% on ASVspoof2021.

* Accepted to Interspeech2025

Via

Access Paper or Ask Questions

Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking

May 29, 2025

Yuatyong Chaichana, Thanapat Trachu, Peerat Limkonchotiwat, Konpat Preechakul, Tirasan Khandhawit, Ekapol Chuangsuwanich

Abstract:In the era of large-scale training, model merging has evolved into a tool for creating multitasking models efficiently. It enables the knowledge of models to be fused, without the need for heavy computation as required in traditional multitask learning. Existing merging methods often assume that entries at identical positions in weight matrices serve the same function, enabling straightforward entry-wise comparison and merging. However, this assumption overlooks the complexity of finetuned neural networks, where neurons may develop distinct feature compositions, making direct entry-wise merging problematic. We present Decom-Renorm-Merge (DRM), a simple yet effective approach that leverages Singular Value Decomposition to decompose and coordinate weight matrices into an aligned joint space, where entry-wise merging becomes possible. We showcase the effectiveness of DRM across various settings ranging from smaller encoder-based such as ViT and DeBERTa, encoder-decoder-based such as T5, and larger decoder-based such as Llama3.1-8B. Our experimental results show that DRM outperforms several state-of-the-art merging techniques across full finetuning and low-rank adaptation settings. Moreover, our analysis reveals renormalization as the crucial component for creating a robust and even joint space for merging, significantly contributing to the method's performance.

Via

Access Paper or Ask Questions

Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge

Jun 10, 2024

Thanapat Trachu, Chawan Piansaddhayanon, Ekapol Chuangsuwanich

Figure 1 for Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge

Figure 2 for Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge

Figure 3 for Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge

Figure 4 for Thunder : Unified Regression-Diffusion Speech Enhancement with a Single Reverse Step using Brownian Bridge

Abstract:Diffusion-based speech enhancement has shown promising results, but can suffer from a slower inference time. Initializing the diffusion process with the enhanced audio generated by a regression-based model can be used to reduce the computational steps required. However, these approaches often necessitate a regression model, further increasing the system's complexity. We propose Thunder, a unified regression-diffusion model that utilizes the Brownian bridge process which can allow the model to act in both modes. The regression mode can be accessed by setting the diffusion time step closed to 1. However, the standard score-based diffusion modeling does not perform well in this setup due to gradient instability. To mitigate this problem, we modify the diffusion model to predict the clean speech instead of the score function, achieving competitive performance with a more compact model size and fewer reverse steps.

* 5 pages, 3 figures, 4 tables, This paper will be submitted in the interspeech conference

Via

Access Paper or Ask Questions