Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jiangyan Yi

Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms

Sep 13, 2023

Chu Yuan Zhang, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Xinrui Yan

Figure 1 for Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms

Figure 2 for Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms

Figure 3 for Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms

Figure 4 for Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms

Abstract:Recent strides in neural speech synthesis technologies, while enjoying widespread applications, have nonetheless introduced a series of challenges, spurring interest in the defence against the threat of misuse and abuse. Notably, source attribution of synthesized speech has value in forensics and intellectual property protection, but prior work in this area has certain limitations in scope. To address the gaps, we present our findings concerning the identification of the sources of synthesized speech in this paper. We investigate the existence of speech synthesis model fingerprints in the generated speech waveforms, with a focus on the acoustic model and the vocoder, and study the influence of each component on the fingerprint in the overall speech waveforms. Our research, conducted using the multi-speaker LibriTTS dataset, demonstrates two key insights: (1) vocoders and acoustic models impart distinct, model-specific fingerprints on the waveforms they generate, and (2) vocoder fingerprints are the more dominant of the two, and may mask the fingerprints from the acoustic model. These findings strongly suggest the existence of model-specific fingerprints for both the acoustic model and the vocoder, highlighting their potential utility in source identification applications.

* Submitted to ICASSP 2024

Via

Access Paper or Ask Questions

DGSD: Dynamical Graph Self-Distillation for EEG-Based Auditory Spatial Attention Detection

Sep 07, 2023

Cunhang Fan, Hongyu Zhang, Wei Huang, Jun Xue, Jianhua Tao, Jiangyan Yi, Zhao Lv, Xiaopei Wu

Figure 1 for DGSD: Dynamical Graph Self-Distillation for EEG-Based Auditory Spatial Attention Detection

Figure 2 for DGSD: Dynamical Graph Self-Distillation for EEG-Based Auditory Spatial Attention Detection

Figure 3 for DGSD: Dynamical Graph Self-Distillation for EEG-Based Auditory Spatial Attention Detection

Figure 4 for DGSD: Dynamical Graph Self-Distillation for EEG-Based Auditory Spatial Attention Detection

Abstract:Auditory Attention Detection (AAD) aims to detect target speaker from brain signals in a multi-speaker environment. Although EEG-based AAD methods have shown promising results in recent years, current approaches primarily rely on traditional convolutional neural network designed for processing Euclidean data like images. This makes it challenging to handle EEG signals, which possess non-Euclidean characteristics. In order to address this problem, this paper proposes a dynamical graph self-distillation (DGSD) approach for AAD, which does not require speech stimuli as input. Specifically, to effectively represent the non-Euclidean properties of EEG signals, dynamical graph convolutional networks are applied to represent the graph structure of EEG signals, which can also extract crucial features related to auditory spatial attention in EEG signals. In addition, to further improve AAD detection performance, self-distillation, consisting of feature distillation and hierarchical distillation strategies at each layer, is integrated. These strategies leverage features and classification results from the deepest network layers to guide the learning of shallow layers. Our experiments are conducted on two publicly available datasets, KUL and DTU. Under a 1-second time window, we achieve results of 90.0\% and 79.6\% accuracy on KUL and DTU, respectively. We compare our DGSD method with competitive baselines, and the experimental results indicate that the detection performance of our proposed DGSD method is not only superior to the best reproducible baseline but also significantly reduces the number of trainable parameters by approximately 100 times.

Via

Access Paper or Ask Questions

Audio Deepfake Detection: A Survey

Aug 29, 2023

Jiangyan Yi, Chenglong Wang, Jianhua Tao, Xiaohui Zhang, Chu Yuan Zhang, Yan Zhao

Figure 1 for Audio Deepfake Detection: A Survey

Figure 2 for Audio Deepfake Detection: A Survey

Figure 3 for Audio Deepfake Detection: A Survey

Figure 4 for Audio Deepfake Detection: A Survey

Abstract:Audio deepfake detection is an emerging active topic. A growing number of literatures have aimed to study deepfake detection algorithms and achieved effective performance, the problem of which is far from being solved. Although there are some review literatures, there has been no comprehensive survey that provides researchers with a systematic overview of these developments with a unified evaluation. Accordingly, in this survey paper, we first highlight the key differences across various types of deepfake audio, then outline and analyse competitions, datasets, features, classifications, and evaluation of state-of-the-art approaches. For each aspect, the basic techniques, advanced developments and major challenges are discussed. In addition, we perform a unified comparison of representative features and classifiers on ASVspoof 2021, ADD 2023 and In-the-Wild datasets for audio deepfake detection, respectively. The survey shows that future research should address the lack of large scale datasets in the wild, poor generalization of existing detection methods to unknown fake attacks, as well as interpretability of detection results.

Via

Access Paper or Ask Questions

Spatial Reconstructed Local Attention Res2Net with F0 Subband for Fake Speech Detection

Aug 19, 2023

Cunhang Fan, Jun Xue, Jianhua Tao, Jiangyan Yi, Chenglong Wang, Chengshi Zheng, Zhao Lv

Abstract:The rhythm of synthetic speech is usually too smooth, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 subband so as to improve the performance of FSD, the spatial reconstructed local attention Res2Net (SR-LA Res2Net) is proposed. Specifically, Res2Net is used as a backbone network to obtain multiscale information, and enhanced with a spatial reconstruction mechanism to avoid losing important information when the channel group is constantly superimposed. In addition, local attention is designed to make the model focus on the local information of the F0 subband. Experimental results on the ASVspoof 2019 LA dataset show that our proposed method obtains an equal error rate (EER) of 0.47% and a minimum tandem detection cost function (min t-DCF) of 0.0159, achieving the state-of-the-art performance among all of the single systems.

Via

Access Paper or Ask Questions

Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection

Aug 07, 2023

Xiaohui Zhang, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Chuyuan Zhang

Figure 1 for Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection

Figure 2 for Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection

Figure 3 for Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection

Figure 4 for Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection

Abstract:Current fake audio detection algorithms have achieved promising performances on most datasets. However, their performance may be significantly degraded when dealing with audio of a different dataset. The orthogonal weight modification to overcome catastrophic forgetting does not consider the similarity of genuine audio across different datasets. To overcome this limitation, we propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting, called Regularized Adaptive Weight Modification (RAWM). When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances. The adaptive modification direction ensures the network can effectively detect fake audio on the new dataset while preserving its knowledge of old model, thus mitigating catastrophic forgetting. In addition, genuine audio collected from quite different acoustic conditions may skew their feature distribution, so we introduce a regularization constraint to force the network to remember the old distribution in this regard. Our method can easily be generalized to related fields, like speech emotion recognition. We also evaluate our approach across multiple datasets and obtain a significant performance improvement on cross-dataset experiments.

* 40th Internation Conference on Machine Learning (ICML 2023)

Via

Access Paper or Ask Questions

TST: Time-Sparse Transducer for Automatic Speech Recognition

Jul 17, 2023

Xiaohui Zhang, Mangui Liang, Zhengkun Tian, Jiangyan Yi, Jianhua Tao

Figure 1 for TST: Time-Sparse Transducer for Automatic Speech Recognition

Figure 2 for TST: Time-Sparse Transducer for Automatic Speech Recognition

Figure 3 for TST: Time-Sparse Transducer for Automatic Speech Recognition

Figure 4 for TST: Time-Sparse Transducer for Automatic Speech Recognition

Abstract:End-to-end model, especially Recurrent Neural Network Transducer (RNN-T), has achieved great success in speech recognition. However, transducer requires a great memory footprint and computing time when processing a long decoding sequence. To solve this problem, we propose a model named time-sparse transducer, which introduces a time-sparse mechanism into transducer. In this mechanism, we obtain the intermediate representations by reducing the time resolution of the hidden states. Then the weighted average algorithm is used to combine these representations into sparse hidden states followed by the decoder. All the experiments are conducted on a Mandarin dataset AISHELL-1. Compared with RNN-T, the character error rate of the time-sparse transducer is close to RNN-T and the real-time factor is 50.00% of the original. By adjusting the time resolution, the time-sparse transducer can also reduce the real-time factor to 16.54% of the original at the expense of a 4.94% loss of precision.

* International Conference on Artificial Intelligence (CICAI 2023)
* 10 pages

Via

Access Paper or Ask Questions

Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection

Jun 09, 2023

Chenglong Wang, Jiangyan Yi, Xiaohui Zhang, Jianhua Tao, Le Xu, Ruibo Fu

Figure 1 for Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection

Figure 2 for Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection

Figure 3 for Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection

Figure 4 for Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection

Abstract:Self-supervised speech models are a rapidly developing research topic in fake audio detection. Many pre-trained models can serve as feature extractors, learning richer and higher-level speech features. However,when fine-tuning pre-trained models, there is often a challenge of excessively long training times and high memory consumption, and complete fine-tuning is also very expensive. To alleviate this problem, we apply low-rank adaptation(LoRA) to the wav2vec2 model, freezing the pre-trained model weights and injecting a trainable rank-decomposition matrix into each layer of the transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared with fine-tuning with Adam on the wav2vec2 model containing 317M training parameters, LoRA achieved similar performance by reducing the number of trainable parameters by 198 times.

* IJCAI 2023 Workshop on Deepfake Audio Detection and Analysis
* 6pages

Via

Access Paper or Ask Questions

Adaptive Fake Audio Detection with Low-Rank Model Squeezing

Jun 08, 2023

Xiaohui Zhang, Jiangyan Yi, Jianhua Tao, Chenlong Wang, Le Xu, Ruibo Fu

Figure 1 for Adaptive Fake Audio Detection with Low-Rank Model Squeezing

Figure 2 for Adaptive Fake Audio Detection with Low-Rank Model Squeezing

Figure 3 for Adaptive Fake Audio Detection with Low-Rank Model Squeezing

Abstract:The rapid advancement of spoofing algorithms necessitates the development of robust detection methods capable of accurately identifying emerging fake audio. Traditional approaches, such as finetuning on new datasets containing these novel spoofing algorithms, are computationally intensive and pose a risk of impairing the acquired knowledge of known fake audio types. To address these challenges, this paper proposes an innovative approach that mitigates the limitations associated with finetuning. We introduce the concept of training low-rank adaptation matrices tailored specifically to the newly emerging fake audio types. During the inference stage, these adaptation matrices are combined with the existing model to generate the final prediction output. Extensive experimentation is conducted to evaluate the efficacy of the proposed method. The results demonstrate that our approach effectively preserves the prediction accuracy of the existing model for known fake audio types. Furthermore, our approach offers several advantages, including reduced storage memory requirements and lower equal error rates compared to conventional finetuning methods, particularly on specific spoofing algorithms.

Via

Access Paper or Ask Questions

TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection

May 23, 2023

Chenglong Wang, Jiangyan Yi, Jianhua Tao, Chuyuan Zhang, Shuai Zhang, Ruibo Fu, Xun Chen

Figure 1 for TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection

Figure 2 for TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection

Figure 3 for TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection

Figure 4 for TO-Rawnet: Improving RawNet with TCN and Orthogonal Regularization for Fake Audio Detection

Abstract:Current fake audio detection relies on hand-crafted features, which lose information during extraction. To overcome this, recent studies use direct feature extraction from raw audio signals. For example, RawNet is one of the representative works in end-to-end fake audio detection. However, existing work on RawNet does not optimize the parameters of the Sinc-conv during training, which limited its performance. In this paper, we propose to incorporate orthogonal convolution into RawNet, which reduces the correlation between filters when optimizing the parameters of Sinc-conv, thus improving discriminability. Additionally, we introduce temporal convolutional networks (TCN) to capture long-term dependencies in speech signals. Experiments on the ASVspoof 2019 show that the Our TO-RawNet system can relatively reduce EER by 66.09\% on logical access scenario compared with the RawNet, demonstrating its effectiveness in detecting fake audio attacks.

* Interspeech2023

Via

Access Paper or Ask Questions

Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features

May 23, 2023

Chenglong Wang, Jiangyan Yi, Jianhua Tao, Chuyuan Zhang, Shuai Zhang, Xun Chen

Figure 1 for Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features

Figure 2 for Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features

Figure 3 for Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features

Figure 4 for Detection of Cross-Dataset Fake Audio Based on Prosodic and Pronunciation Features

Abstract:Existing fake audio detection systems perform well in in-domain testing, but still face many challenges in out-of-domain testing. This is due to the mismatch between the training and test data, as well as the poor generalizability of features extracted from limited views. To address this, we propose multi-view features for fake audio detection, which aim to capture more generalized features from prosodic, pronunciation, and wav2vec dimensions. Specifically, the phoneme duration features are extracted from a pre-trained model based on a large amount of speech data. For the pronunciation features, a Conformer-based phoneme recognition model is first trained, keeping the acoustic encoder part as a deeply embedded feature extractor. Furthermore, the prosodic and pronunciation features are fused with wav2vec features based on an attention mechanism to improve the generalization of fake audio detection models. Results show that the proposed approach achieves significant performance gains in several cross-dataset experiments.

* Interspeech2023

Via

Access Paper or Ask Questions