Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Xiaoxiao Miao

MS-GAGA: Metric-Selective Guided Adversarial Generation Attack

Oct 14, 2025

Dion J. X. Ho, Gabriel Lee Jun Rong, Niharika Shrivastava, Harshavardhan Abichandani, Pai Chet Ng, Xiaoxiao Miao

Abstract:We present MS-GAGA (Metric-Selective Guided Adversarial Generation Attack), a two-stage framework for crafting transferable and visually imperceptible adversarial examples against deepfake detectors in black-box settings. In Stage 1, a dual-stream attack module generates adversarial candidates: MNTD-PGD applies enhanced gradient calculations optimized for small perturbation budgets, while SG-PGD focuses perturbations on visually salient regions. This complementary design expands the adversarial search space and improves transferability across unseen models. In Stage 2, a metric-aware selection module evaluates candidates based on both their success against black-box models and their structural similarity (SSIM) to the original image. By jointly optimizing transferability and imperceptibility, MS-GAGA achieves up to 27% higher misclassification rates on unseen detectors compared to state-of-the-art attacks.

* BMVC 2025 Workshop on Privacy, Fairness, Accountability and Transparency in Computer Vision

Via

Access Paper or Ask Questions

Exploring Machine Learning and Language Models for Multimodal Depression Detection

Aug 28, 2025

Javier Si Zhao Hong, Timothy Zoe Delaya, Sherwyn Chan Yin Kit, Pai Chet Ng, Xiaoxiao Miao

Abstract:This paper presents our approach to the first Multimodal Personality-Aware Depression Detection Challenge, focusing on multimodal depression detection using machine learning and deep learning models. We explore and compare the performance of XGBoost, transformer-based architectures, and large language models (LLMs) on audio, video, and text features. Our results highlight the strengths and limitations of each type of model in capturing depression-related signals across modalities, offering insights into effective multimodal representation strategies for mental health prediction.

* This paper has been accepted by APCIPA ASC 2025

Via

Access Paper or Ask Questions

Speech Emotion Recognition via Entropy-Aware Score Selection

Aug 28, 2025

ChenYi Chua, JunKai Wong, Chengxin Chen, Xiaoxiao Miao

Abstract:In this paper, we propose a multimodal framework for speech emotion recognition that leverages entropy-aware score selection to combine speech and textual predictions. The proposed method integrates a primary pipeline that consists of an acoustic model based on wav2vec2.0 and a secondary pipeline that consists of a sentiment analysis model using RoBERTa-XLM, with transcriptions generated via Whisper-large-v3. We propose a late score fusion approach based on entropy and varentropy thresholds to overcome the confidence constraints of primary pipeline predictions. A sentiment mapping strategy translates three sentiment categories into four target emotion classes, enabling coherent integration of multimodal predictions. The results on the IEMOCAP and MSP-IMPROV datasets show that the proposed method offers a practical and reliable enhancement over traditional single-modality systems.

* The paper has been accepted by APCIPA ASC 2025

Via

Access Paper or Ask Questions

SegReConcat: A Data Augmentation Method for Voice Anonymization Attack

Aug 26, 2025

Ridwan Arefeen, Xiaoxiao Miao, Rong Tong, Aik Beng Ng, Simon See

Abstract:Anonymization of voice seeks to conceal the identity of the speaker while maintaining the utility of speech data. However, residual speaker cues often persist, which pose privacy risks. We propose SegReConcat, a data augmentation method for attacker-side enhancement of automatic speaker verification systems. SegReConcat segments anonymized speech at the word level, rearranges segments using random or similarity-based strategies to disrupt long-term contextual cues, and concatenates them with the original utterance, allowing an attacker to learn source speaker traits from multiple perspectives. The proposed method has been evaluated in the VoicePrivacy Attacker Challenge 2024 framework across seven anonymization systems, SegReConcat improves de-anonymization on five out of seven systems.

* The Paper has been accepted by APCIPA ASC 2025

Via

Access Paper or Ask Questions

SEF-MK: Speaker-Embedding-Free Voice Anonymization through Multi-k-means Quantization

Aug 09, 2025

Beilong Tang, Xiaoxiao Miao, Xin Wang, Ming Li

Abstract:Voice anonymization protects speaker privacy by concealing identity while preserving linguistic and paralinguistic content. Self-supervised learning (SSL) representations encode linguistic features but preserve speaker traits. We propose a novel speaker-embedding-free framework called SEF-MK. Instead of using a single k-means model trained on the entire dataset, SEF-MK anonymizes SSL representations for each utterance by randomly selecting one of multiple k-means models, each trained on a different subset of speakers. We explore this approach from both attacker and user perspectives. Extensive experiments show that, compared to a single k-means model, SEF-MK with multiple k-means models better preserves linguistic and emotional content from the user's viewpoint. However, from the attacker's perspective, utilizing multiple k-means models boosts the effectiveness of privacy attacks. These insights can aid users in designing voice anonymization systems to mitigate attacker threats.

* 8 pages, 3 figures, accepted by 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Via

Access Paper or Ask Questions

The Risks and Detection of Overestimated Privacy Protection in Voice Anonymisation

Jul 30, 2025

Michele Panariello, Sarina Meyer, Pierre Champion, Xiaoxiao Miao, Massimiliano Todisco, Ngoc Thang Vu, Nicholas Evans

Abstract:Voice anonymisation aims to conceal the voice identity of speakers in speech recordings. Privacy protection is usually estimated from the difficulty of using a speaker verification system to re-identify the speaker post-anonymisation. Performance assessments are therefore dependent on the verification model as well as the anonymisation system. There is hence potential for privacy protection to be overestimated when the verification system is poorly trained, perhaps with mismatched data. In this paper, we demonstrate the insidious risk of overestimating anonymisation performance and show examples of exaggerated performance reported in the literature. For the worst case we identified, performance is overestimated by 74% relative. We then introduce a means to detect when performance assessment might be untrustworthy and show that it can identify all overestimation scenarios presented in the paper. Our solution is openly available as a fork of the 2024 VoicePrivacy Challenge evaluation toolkit.

* Accepted at SPSC 2025 - 5th Symposium on Security and Privacy in Speech Communication

Via

Access Paper or Ask Questions

SecureSpeech: Prompt-based Speaker and Content Protection

Jul 10, 2025

Belinda Soh Hui Hui, Xiaoxiao Miao, Xin Wang

Abstract:Given the increasing privacy concerns from identity theft and the re-identification of speakers through content in the speech field, this paper proposes a prompt-based speech generation pipeline that ensures dual anonymization of both speaker identity and spoken content. This is addressed through 1) generating a speaker identity unlinkable to the source speaker, controlled by descriptors, and 2) replacing sensitive content within the original text using a name entity recognition model and a large language model. The pipeline utilizes the anonymized speaker identity and text to generate high-fidelity, privacy-friendly speech via a text-to-speech synthesis model. Experimental results demonstrate an achievement of significant privacy protection while maintaining a decent level of content retention and audio quality. This paper also investigates the impact of varying speaker descriptions on the utility and privacy of generated speech to determine potential biases.

* Accepted by IEEE International Joint Conference on Biometrics (IJCB) 2025

Via

Access Paper or Ask Questions

Automated evaluation of children's speech fluency for low-resource languages

May 26, 2025

Bowen Zhang, Nur Afiqah Abdul Latiff, Justin Kan, Rong Tong, Donny Soh, Xiaoxiao Miao, Ian McLoughlin

Abstract:Assessment of children's speaking fluency in education is well researched for majority languages, but remains highly challenging for low resource languages. This paper proposes a system to automatically assess fluency by combining a fine-tuned multilingual ASR model, an objective metrics extraction stage, and a generative pre-trained transformer (GPT) network. The objective metrics include phonetic and word error rates, speech rate, and speech-pause duration ratio. These are interpreted by a GPT-based classifier guided by a small set of human-evaluated ground truth examples, to score fluency. We evaluate the proposed system on a dataset of children's speech in two low-resource languages, Tamil and Malay and compare the classification performance against Random Forest and XGBoost, as well as using ChatGPT-4o to predict fluency directly from speech input. Results demonstrate that the proposed approach achieves significantly higher accuracy than multimodal GPT or other methods.

* 5 pages, 2 figures, conference

Via

Access Paper or Ask Questions

Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data

Dec 17, 2024

Yun Liu, Xuechen Liu, Xiaoxiao Miao, Junichi Yamagishi

Figure 1 for Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data

Figure 2 for Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data

Figure 3 for Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data

Figure 4 for Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data

Abstract:Target speaker extraction (TSE) is essential in speech processing applications, particularly in scenarios with complex acoustic environments. Current TSE systems face challenges in limited data diversity and a lack of robustness in real-world conditions, primarily because they are trained on artificially mixed datasets with limited speaker variability and unrealistic noise profiles. To address these challenges, we propose Libri2Vox, a new dataset that combines clean target speech from the LibriTTS dataset with interference speech from the noisy VoxCeleb2 dataset, providing a large and diverse set of speakers under realistic noisy conditions. We also augment Libri2Vox with synthetic speakers generated using state-of-the-art speech generative models to enhance speaker diversity. Additionally, to further improve the effectiveness of incorporating synthetic data, curriculum learning is implemented to progressively train TSE models with increasing levels of difficulty. Extensive experiments across multiple TSE architectures reveal varying degrees of improvement, with SpeakerBeam demonstrating the most substantial gains: a 1.39 dB improvement in signal-to-distortion ratio (SDR) on the Libri2Talker test set compared to baseline training. Building upon these results, we further enhanced performance through our speaker similarity-based curriculum learning approach with the Conformer architecture, achieving an additional 0.78 dB improvement over conventional random sampling methods in which data samples are randomly selected from the entire dataset. These results demonstrate the complementary benefits of diverse real-world data, synthetic speaker augmentation, and structured training strategies in building robust TSE systems.

Via

Access Paper or Ask Questions

The First VoicePrivacy Attacker Challenge Evaluation Plan

Oct 09, 2024

Natalia Tomashenko, Xiaoxiao Miao, Emmanuel Vincent, Junichi Yamagishi

Figure 1 for The First VoicePrivacy Attacker Challenge Evaluation Plan

Figure 2 for The First VoicePrivacy Attacker Challenge Evaluation Plan

Figure 3 for The First VoicePrivacy Attacker Challenge Evaluation Plan

Figure 4 for The First VoicePrivacy Attacker Challenge Evaluation Plan

Abstract:The First VoicePrivacy Attacker Challenge is a new kind of challenge organized as part of the VoicePrivacy initiative and supported by ICASSP 2025 as the SP Grand Challenge It focuses on developing attacker systems against voice anonymization, which will be evaluated against a set of anonymization systems submitted to the VoicePrivacy 2024 Challenge. Training, development, and evaluation datasets are provided along with a baseline attacker system. Participants shall develop their attacker systems in the form of automatic speaker verification systems and submit their scores on the development and evaluation data to the organizers. To do so, they can use any additional training data and models, provided that they are openly available and declared before the specified deadline. The metric for evaluation is equal error rate (EER). Results will be presented at the ICASSP 2025 special session to which 5 selected top-ranked participants will be invited to submit and present their challenge systems.

Via

Access Paper or Ask Questions