Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tomi Kinnunen

Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes

Oct 27, 2024

Ivan Kukanov, Janne Laakkonen, Tomi Kinnunen, Ville Hautamäki

Figure 1 for Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes

Figure 2 for Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes

Figure 3 for Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes

Figure 4 for Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes

Abstract:Current speech deepfake detection approaches perform satisfactorily against known adversaries; however, generalization to unseen attacks remains an open challenge. The proliferation of speech deepfakes on social media underscores the need for systems that can generalize to unseen attacks not observed during training. We address this problem from the perspective of meta-learning, aiming to learn attack-invariant features to adapt to unseen attacks with very few samples available. This approach is promising since generating of a high-scale training dataset is often expensive or infeasible. Our experiments demonstrated an improvement in the Equal Error Rate (EER) from 21.67% to 10.42% on the InTheWild dataset, using just 96 samples from the unseen dataset. Continuous few-shot adaptation ensures that the system remains up-to-date.

* 6 pages, accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024

Via

Access Paper or Ask Questions

ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Aug 16, 2024

Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen(+3 more)

Figure 1 for ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Figure 2 for ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Figure 3 for ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Figure 4 for ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Abstract:ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof 5 database is built from crowdsourced data collected from a vastly greater number of speakers in diverse acoustic conditions. Attacks, also crowdsourced, are generated and tested using surrogate detection models, while adversarial attacks are incorporated for the first time. New metrics support the evaluation of spoofing-robust automatic speaker verification (SASV) as well as stand-alone detection solutions, i.e., countermeasures without ASV. We describe the two challenge tracks, the new database, the evaluation metrics, baselines, and the evaluation platform, and present a summary of the results. Attacks significantly compromise the baseline systems, while submissions bring substantial improvements.

* 8 pages, ASVspoof 5 Workshop (Interspeech2024 Satellite)

Via

Access Paper or Ask Questions

Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing

Jun 25, 2024

Hye-jin Shim, Md Sahidullah, Jee-weon Jung, Shinji Watanabe, Tomi Kinnunen

Figure 1 for Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing

Figure 2 for Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing

Figure 3 for Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing

Figure 4 for Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing

Abstract:Current trends in audio anti-spoofing detection research strive to improve models' ability to generalize across unseen attacks by learning to identify a variety of spoofing artifacts. This emphasis has primarily focused on the spoof class. Recently, several studies have noted that the distribution of silence differs between the two classes, which can serve as a shortcut. In this paper, we extend class-wise interpretations beyond silence. We employ loss analysis and asymmetric methodologies to move away from traditional attack-focused and result-oriented evaluations towards a deeper examination of model behaviors. Our investigations highlight the significant differences in training dynamics between the two classes, emphasizing the need for future research to focus on robust modeling of the bonafide class.

* 5 pages, 1 figure, 5 tables

Via

Access Paper or Ask Questions

Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis

Jun 16, 2024

Xin Wang, Tomi Kinnunen, Kong Aik Lee, Paul-Gauthier Noé, Junichi Yamagishi

Figure 1 for Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis

Figure 2 for Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis

Figure 3 for Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis

Figure 4 for Revisiting and Improving Scoring Fusion for Spoofing-aware Speaker Verification Using Compositional Data Analysis

Abstract:Fusing outputs from automatic speaker verification (ASV) and spoofing countermeasure (CM) is expected to make an integrated system robust to zero-effort imposters and synthesized spoofing attacks. Many score-level fusion methods have been proposed, but many remain heuristic. This paper revisits score-level fusion using tools from decision theory and presents three main findings. First, fusion by summing the ASV and CM scores can be interpreted on the basis of compositional data analysis, and score calibration before fusion is essential. Second, the interpretation leads to an improved fusion method that linearly combines the log-likelihood ratios of ASV and CM. However, as the third finding reveals, this linear combination is inferior to a non-linear one in making optimal decisions. The outcomes of these findings, namely, the score calibration before fusion, improved linear fusion, and better non-linear fusion, were found to be effective on the SASV challenge database.

* Interspeech 2024 Accepted. https://github.com/nii-yamagishilab/SpeechSPC-mini

Via

Access Paper or Ask Questions

ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR

Jun 14, 2024

Vishwanath Pratap Singh, Federico Malato, Ville Hautamaki, Md. Sahidullah, Tomi Kinnunen

Figure 1 for ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR

Figure 2 for ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR

Figure 3 for ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR

Figure 4 for ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR

Abstract:While automatic speech recognition (ASR) greatly benefits from data augmentation, the augmentation recipes themselves tend to be heuristic. In this paper, we address one of the heuristic approach associated with balancing the right amount of augmented data in ASR training by introducing a reinforcement learning (RL) based dynamic adjustment of original-to-augmented data ratio (OAR). Unlike the fixed OAR approach in conventional data augmentation, our proposed method employs a deep Q-network (DQN) as the RL mechanism to learn the optimal dynamics of OAR throughout the wav2vec2.0 based ASR training. We conduct experiments using the LibriSpeech dataset with varying amounts of training data, specifically, the 10Min, 1H, 10H, and 100H splits to evaluate the efficacy of the proposed method under different data conditions. Our proposed method, on average, achieves a relative improvement of 4.96% over the open-source wav2vec2.0 base model on standard LibriSpeech test sets.

* Interspeech 2024
* Accepted: Interspeech 2024

Via

Access Paper or Ask Questions

a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification

Mar 03, 2024

Hye-jin Shim, Jee-weon Jung, Tomi Kinnunen, Nicholas Evans, Jean-Francois Bonastre, Itshak Lapidot

Figure 1 for a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification

Figure 2 for a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification

Figure 3 for a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification

Figure 4 for a-DCF: an architecture agnostic metric with application to spoofing-robust speaker verification

Abstract:Spoofing detection is today a mainstream research topic. Standard metrics can be applied to evaluate the performance of isolated spoofing detection solutions and others have been proposed to support their evaluation when they are combined with speaker detection. These either have well-known deficiencies or restrict the architectural approach to combine speaker and spoof detectors. In this paper, we propose an architecture-agnostic detection cost function (a-DCF). A generalisation of the original DCF used widely for the assessment of automatic speaker verification (ASV), the a-DCF is designed for the evaluation of spoofing-robust ASV. Like the DCF, the a-DCF reflects the cost of decisions in a Bayes risk sense, with explicitly defined class priors and detection cost model. We demonstrate the merit of the a-DCF through the benchmarking evaluation of architecturally-heterogeneous spoofing-robust ASV solutions.

* 8 pages, submitted to Speaker Odyssey 2024

Via

Access Paper or Ask Questions

ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification

Feb 23, 2024

Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen

Figure 1 for ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification

Figure 2 for ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification

Figure 3 for ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification

Figure 4 for ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification

Abstract:The accuracy of modern automatic speaker verification (ASV) systems, when trained exclusively on adult data, drops substantially when applied to children's speech. The scarcity of children's speech corpora hinders fine-tuning ASV systems for children's speech. Hence, there is a timely need to explore more effective ways of reusing adults' speech data. One promising approach is to align vocal-tract parameters between adults and children through children-specific data augmentation, referred here to as ChildAugment. Specifically, we modify the formant frequencies and formant bandwidths of adult speech to emulate children's speech. The modified spectra are used to train ECAPA-TDNN (emphasized channel attention, propagation, and aggregation in time-delay neural network) recognizer for children. We compare ChildAugment against various state-of-the-art data augmentation techniques for children's ASV. We also extensively compare different scoring methods, including cosine scoring, PLDA (probabilistic linear discriminant analysis), and NPLDA (neural PLDA). We also propose a low-complexity weighted cosine score for extremely low-resource children ASV. Our findings on the CSLU kids corpus indicate that ChildAugment holds promise as a simple, acoustics-motivated approach, for improving state-of-the-art deep learning based ASV for children. We achieve up to 12.45% (boys) and 11.96% (girls) relative improvement over the baseline.

* The following article has been accepted by The Journal of the Acoustical Society of America (JASA). After it is published, it will be found at https://pubs.aip.org/asa/jasa

Via

Access Paper or Ask Questions

Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Jan 28, 2024

Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen

Figure 1 for Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Figure 2 for Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Figure 3 for Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Figure 4 for Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Abstract:It is now well-known that automatic speaker verification (ASV) systems can be spoofed using various types of adversaries. The usual approach to counteract ASV systems against such attacks is to develop a separate spoofing countermeasure (CM) module to classify speech input either as a bonafide, or a spoofed utterance. Nevertheless, such a design requires additional computation and utilization efforts at the authentication stage. An alternative strategy involves a single monolithic ASV system designed to handle both zero-effort imposter (non-targets) and spoofing attacks. Such spoof-aware ASV systems have the potential to provide stronger protections and more economic computations. To this end, we propose to generalize the standalone ASV (G-SASV) against spoofing attacks, where we leverage limited training data from CM to enhance a simple backend in the embedding space, without the involvement of a separate CM module during the test (authentication) phase. We propose a novel yet simple backend classifier based on deep neural networks and conduct the study via domain adaptation and multi-task integration of spoof embeddings at the training stage. Experiments are conducted on the ASVspoof 2019 logical access dataset, where we improve the performance of statistical ASV backends on the joint (bonafide and spoofed) and spoofed conditions by a maximum of 36.2% and 49.8% in terms of equal error rates, respectively.

* Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (doi updated)

Via

Access Paper or Ask Questions

t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Sep 21, 2023

Tomi Kinnunen, Kong Aik Lee, Hemlata Tak, Nicholas Evans, Andreas Nautsch

Figure 1 for t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Figure 2 for t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Figure 3 for t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Figure 4 for t-EER: Parameter-Free Tandem Evaluation of Countermeasures and Biometric Comparators

Abstract:Presentation attack (spoofing) detection (PAD) typically operates alongside biometric verification to improve reliablity in the face of spoofing attacks. Even though the two sub-systems operate in tandem to solve the single task of reliable biometric verification, they address different detection tasks and are hence typically evaluated separately. Evidence shows that this approach is suboptimal. We introduce a new metric for the joint evaluation of PAD solutions operating in situ with biometric verification. In contrast to the tandem detection cost function proposed recently, the new tandem equal error rate (t-EER) is parameter free. The combination of two classifiers nonetheless leads to a \emph{set} of operating points at which false alarm and miss rates are equal and also dependent upon the prevalence of attacks. We therefore introduce the \emph{concurrent} t-EER, a unique operating point which is invariable to the prevalence of attacks. Using both modality (and even application) agnostic simulated scores, as well as real scores for a voice biometrics application, we demonstrate application of the t-EER to a wide range of biometric system evaluations under attack. The proposed approach is a strong candidate metric for the tandem evaluation of PAD systems and biometric comparators.

* To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence. For associated codes, see https://github.com/TakHemlata/T-EER (Github) and https://colab.research.google.com/drive/1ga7eiKFP11wOFMuZjThLJlkBcwEG6_4m?usp=sharing (Google Colab)

Via

Access Paper or Ask Questions

Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech

Jun 13, 2023

Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen

Abstract:In this paper, we study the impact of the ageing on modern deep speaker embedding based automatic speaker verification (ASV) systems. We have selected two different datasets to examine ageing on the state-of-the-art ECAPA-TDNN system. The first dataset, used for addressing short-term ageing (up to 10 years time difference between enrollment and test) under uncontrolled conditions, is VoxCeleb. The second dataset, used for addressing long-term ageing effect (up to 40 years difference) of Finnish speakers under a more controlled setup, is Longitudinal Corpus of Finnish Spoken in Helsinki (LCFSH). Our study provides new insights into the impact of speaker ageing on modern ASV systems. Specifically, we establish a quantitative measure between ageing and ASV scores. Further, our research indicates that ageing affects female English speakers to a greater degree than male English speakers, while in the case of Finnish, it has a greater impact on male speakers than female speakers.

* Interspeech 2023

Via

Access Paper or Ask Questions