Alert button
Picture for Wei-Qiang Zhang

Wei-Qiang Zhang

Alert button

Task-Agnostic Structured Pruning of Speech Representation Models

Jun 02, 2023
Haoyu Wang, Siyuan Wang, Wei-Qiang Zhang, Hongbin Suo, Yulong Wan

Figure 1 for Task-Agnostic Structured Pruning of Speech Representation Models
Figure 2 for Task-Agnostic Structured Pruning of Speech Representation Models
Figure 3 for Task-Agnostic Structured Pruning of Speech Representation Models
Figure 4 for Task-Agnostic Structured Pruning of Speech Representation Models

Self-supervised pre-trained models such as Wav2vec2, Hubert, and WavLM have been shown to significantly improve many speech tasks. However, their large memory and strong computational requirements hinder their industrial applicability. Structured pruning is a hardware-friendly model compression technique but usually results in a larger loss of accuracy. In this paper, we propose a fine-grained attention head pruning method to compensate for the performance degradation. In addition, we also introduce the straight through estimator into the L0 regularization to further accelerate the pruned model. Experiments on the SUPERB benchmark show that our model can achieve comparable performance to the dense model in multiple tasks and outperforms the Wav2vec 2.0 base model on average, with 72% fewer parameters and 2 times faster inference speed.

* Accepted by INTERSPEECH 2023 
Viaarxiv icon

DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model

Jun 02, 2023
Haoyu Wang, Siyuan Wang, Wei-Qiang Zhang, Jinfeng Bai

Figure 1 for DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model
Figure 2 for DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model
Figure 3 for DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model
Figure 4 for DistilXLSR: A Light Weight Cross-Lingual Speech Representation Model

Multilingual self-supervised speech representation models have greatly enhanced the speech recognition performance for low-resource languages, and the compression of these huge models has also become a crucial prerequisite for their industrial application. In this paper, we propose DistilXLSR, a distilled cross-lingual speech representation model. By randomly shuffling the phonemes of existing speech, we reduce the linguistic information and distill cross-lingual models using only English data. We also design a layer-jumping initialization method to fully leverage the teacher's pre-trained weights. Experiments on 2 kinds of teacher models and 15 low-resource languages show that our method can reduce the parameters by 50% while maintaining cross-lingual representation ability. Our method is proven to be generalizable to various languages/teacher models and has the potential to improve the cross-lingual performance of the English pre-trained models.

* Accepted by INTERSPEECH 2023 
Viaarxiv icon

Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning

Apr 20, 2023
Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang, Xukui Yang, Dan Qu, Wei-Qiang Zhang

Figure 1 for Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning
Figure 2 for Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning
Figure 3 for Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning
Figure 4 for Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning

The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharingmechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL (Fine- and Coarse- Granularity Contrastive Learning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information.In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information.

* IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 31, 2023  
Viaarxiv icon

Unsupervised Anomaly Detection and Localization of Machine Audio: A GAN-based Approach

Mar 31, 2023
Anbai Jiang, Wei-Qiang Zhang, Yufeng Deng, Pingyi Fan, Jia Liu

Figure 1 for Unsupervised Anomaly Detection and Localization of Machine Audio: A GAN-based Approach
Figure 2 for Unsupervised Anomaly Detection and Localization of Machine Audio: A GAN-based Approach
Figure 3 for Unsupervised Anomaly Detection and Localization of Machine Audio: A GAN-based Approach
Figure 4 for Unsupervised Anomaly Detection and Localization of Machine Audio: A GAN-based Approach

Automatic detection of machine anomaly remains challenging for machine learning. We believe the capability of generative adversarial network (GAN) suits the need of machine audio anomaly detection, yet rarely has this been investigated by previous work. In this paper, we propose AEGAN-AD, a totally unsupervised approach in which the generator (also an autoencoder) is trained to reconstruct input spectrograms. It is pointed out that the denoising nature of reconstruction deprecates its capacity. Thus, the discriminator is redesigned to aid the generator during both training stage and detection stage. The performance of AEGAN-AD on the dataset of DCASE 2022 Challenge TASK 2 demonstrates the state-of-the-art result on five machine types. A novel anomaly localization method is also investigated. Source code available at: www.github.com/jianganbai/AEGAN-AD

* Accepted by ICASSP 2023 
Viaarxiv icon

Cross-lingual Alzheimer's Disease detection based on paralinguistic and pre-trained features

Mar 14, 2023
Xuchu Chen, Yu Pu, Jinpeng Li, Wei-Qiang Zhang

Figure 1 for Cross-lingual Alzheimer's Disease detection based on paralinguistic and pre-trained features
Figure 2 for Cross-lingual Alzheimer's Disease detection based on paralinguistic and pre-trained features

We present our submission to the ICASSP-SPGC-2023 ADReSS-M Challenge Task, which aims to investigate which acoustic features can be generalized and transferred across languages for Alzheimer's Disease (AD) prediction. The challenge consists of two tasks: one is to classify the speech of AD patients and healthy individuals, and the other is to infer Mini Mental State Examination (MMSE) score based on speech only. The difficulty is mainly embodied in the mismatch of the dataset, in which the training set is in English while the test set is in Greek. We extract paralinguistic features using openSmile toolkit and acoustic features using XLSR-53. In addition, we extract linguistic features after transcribing the speech into text. These features are used as indicators for AD detection in our method. Our method achieves an accuracy of 69.6% on the classification task and a root mean squared error (RMSE) of 4.788 on the regression task. The results show that our proposed method is expected to achieve automatic multilingual Alzheimer's Disease detection through spontaneous speech.

* accepted by ICASSP 2023 
Viaarxiv icon

Expressive Speech-driven Facial Animation with controllable emotions

Jan 05, 2023
Yutong Chen, Junhong Zhao, Wei-Qiang Zhang

Figure 1 for Expressive Speech-driven Facial Animation with controllable emotions
Figure 2 for Expressive Speech-driven Facial Animation with controllable emotions
Figure 3 for Expressive Speech-driven Facial Animation with controllable emotions
Figure 4 for Expressive Speech-driven Facial Animation with controllable emotions

It is in high demand to generate facial animation with high realism, but it remains a challenging task. Existing approaches of speech-driven facial animation can produce satisfactory mouth movement and lip synchronization, but show weakness in dramatic emotional expressions and flexibility in emotion control. This paper presents a novel deep learning-based approach for expressive facial animation generation from speech that can exhibit wide-spectrum facial expressions with controllable emotion type and intensity. We propose an emotion controller module to learn the relationship between the emotion variations (e.g., types and intensity) and the corresponding facial expression parameters. It enables emotion-controllable facial animation, where the target expression can be continuously adjusted as desired. The qualitative and quantitative evaluations show that the animation generated by our method is rich in facial emotional expressiveness while retaining accurate lip movement, outperforming other state-of-the-art methods.

Viaarxiv icon

Exploring Effective Fusion Algorithms for Speech Based Self-Supervised Learning Models

Dec 20, 2022
Changli Tang, Yujin Wang, Xie Chen, Wei-Qiang Zhang

Figure 1 for Exploring Effective Fusion Algorithms for Speech Based Self-Supervised Learning Models
Figure 2 for Exploring Effective Fusion Algorithms for Speech Based Self-Supervised Learning Models
Figure 3 for Exploring Effective Fusion Algorithms for Speech Based Self-Supervised Learning Models
Figure 4 for Exploring Effective Fusion Algorithms for Speech Based Self-Supervised Learning Models

Self-supervised learning (SSL) has achieved great success in various areas including speech processing. Recently, it is proven that speech based SSL models are able to extract superior universal representations on a range of downstream tasks compared to traditional hand-craft feature (e.g. FBank, MFCC) in the SUPERB benchmark. However, different types of SSL models might exhibit distinct strengths on different downstream tasks. In order to better utilize the potential power of SSL models, in this work, we explore the effective fusion on multiple SSL models. A series of model fusion algorithms are investigated and compared by combining two types of SSL models, Hubert and Data2vec, on two representative tasks from SUPERB benchmark, which are speaker identification (SID) and automatic speech recognition (ASR) tasks. The experimental results demonstrate that our proposed fusion algorithms can further boost the individual model significantly.

* Accepted by NCMMSC2022 
Viaarxiv icon

LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification

Nov 02, 2022
Xing Chen, Jie Wang, Xiao-Lei Zhang, Wei-Qiang Zhang, Kunde Yang

Figure 1 for LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification
Figure 2 for LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification
Figure 3 for LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification
Figure 4 for LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification

Although the security of automatic speaker verification (ASV) is seriously threatened by recently emerged adversarial attacks, there have been some countermeasures to alleviate the threat. However, many defense approaches not only require the prior knowledge of the attackers but also possess weak interpretability. To address this issue, in this paper, we propose an attacker-independent and interpretable method, named learnable mask detector (LMD), to separate adversarial examples from the genuine ones. It utilizes score variation as an indicator to detect adversarial examples, where the score variation is the absolute discrepancy between the ASV scores of an original audio recording and its transformed audio synthesized from its masked complex spectrogram. A core component of the score variation detector is to generate the masked spectrogram by a neural network. The neural network needs only genuine examples for training, which makes it an attacker-independent approach. Its interpretability lies that the neural network is trained to minimize the score variation of the targeted ASV, and maximize the number of the masked spectrogram bins of the genuine training examples. Its foundation is based on the observation that, masking out the vast majority of the spectrogram bins with little speaker information will inevitably introduce a large score variation to the adversarial example, and a small score variation to the genuine example. Experimental results with 12 attackers and two representative ASV systems show that our proposed method outperforms five state-of-the-art baselines. The extensive experimental results can also be a benchmark for the detection-based ASV defenses.

* 13 pages, 9 figures 
Viaarxiv icon

Symmetric Saliency-based Adversarial Attack To Speaker Identification

Oct 30, 2022
Jiadi Yao, Xing Chen, Xiao-Lei Zhang, Wei-Qiang Zhang, Kunde Yang

Figure 1 for Symmetric Saliency-based Adversarial Attack To Speaker Identification
Figure 2 for Symmetric Saliency-based Adversarial Attack To Speaker Identification
Figure 3 for Symmetric Saliency-based Adversarial Attack To Speaker Identification
Figure 4 for Symmetric Saliency-based Adversarial Attack To Speaker Identification

Adversarial attack approaches to speaker identification either need high computational cost or are not very effective, to our knowledge. To address this issue, in this paper, we propose a novel generation-network-based approach, called symmetric saliency-based encoder-decoder (SSED), to generate adversarial voice examples to speaker identification. It contains two novel components. First, it uses a novel saliency map decoder to learn the importance of speech samples to the decision of a targeted speaker identification system, so as to make the attacker focus on generating artificial noise to the important samples. It also proposes an angular loss function to push the speaker embedding far away from the source speaker. Our experimental results demonstrate that the proposed SSED yields the state-of-the-art performance, i.e. over 97% targeted attack success rate and a signal-to-noise level of over 39 dB on both the open-set and close-set speaker identification tasks, with a low computational cost.

Viaarxiv icon

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Oct 27, 2022
Yujin Wang, Changli Tang, Ziyang Ma, Zhisheng Zheng, Xie Chen, Wei-Qiang Zhang

Figure 1 for Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition
Figure 2 for Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition
Figure 3 for Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition
Figure 4 for Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing. The SSL model is normally pre-trained on a great variety of unlabelled data and a large model size is preferred to increase the modeling capacity. However, this might limit its potential applications due to the expensive computation and memory costs introduced by the oversize model. Miniaturization for SSL models has become an important research direction of practical value. To this end, we explore the effective distillation of HuBERT-based SSL models for automatic speech recognition (ASR). First, in order to establish a strong baseline, a comprehensive study on different student model structures is conducted. On top of this, as a supplement to the regression loss widely adopted in previous works, a discriminative loss is introduced for HuBERT to enhance the distillation performance, especially in low-resource scenarios. In addition, we design a simple and effective algorithm to distill the front-end input from waveform to Fbank feature, resulting in 17% parameter reduction and doubling inference speed, at marginal performance degradation.

* Submitted to ICASSP 2023 
Viaarxiv icon