Alert button
Picture for Yujun Wang

Yujun Wang

Alert button

CED: Consistent ensemble distillation for audio tagging

Sep 08, 2023
Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang

Figure 1 for CED: Consistent ensemble distillation for audio tagging
Figure 2 for CED: Consistent ensemble distillation for audio tagging
Figure 3 for CED: Consistent ensemble distillation for audio tagging
Figure 4 for CED: Consistent ensemble distillation for audio tagging

Augmentation and knowledge distillation (KD) are well-established techniques employed in audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn't been explored before. This paper proposes CED, a simple training framework that distils student models from large teacher ensembles with consistent teaching. To achieve this, CED efficiently stores logits as well as the augmentation methods on disk, making it scalable to large-scale datasets. Central to CED's efficacy is its label-free nature, meaning that only the stored logits are used for the optimization of a student model only requiring 0.3\% additional disk space for AS. The study trains various transformer-based models, including a 10M parameter model achieving a 49.0 mean average precision (mAP) on AS. Pretrained models and code are available at https://github.com/RicherMans/CED.

Viaarxiv icon

Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

Jun 28, 2023
Jiuxin Lin, Peng Wang, Heinrich Dinkel, Jun Chen, Zhiyong Wu, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

Figure 1 for Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information
Figure 2 for Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information
Figure 3 for Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information
Figure 4 for Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

Previously, Target Speaker Extraction (TSE) has yielded outstanding performance in certain application scenarios for speech enhancement and source separation. However, obtaining auxiliary speaker-related information is still challenging in noisy environments with significant reverberation. inspired by the recently proposed distance-based sound separation, we propose the near sound (NS) extractor, which leverages distance information for TSE to reliably extract speaker information without requiring previous speaker enrolment, called speaker embedding self-enrollment (SESE). Full- & sub-band modeling is introduced to enhance our NS-Extractor's adaptability towards environments with significant reverberation. Experimental results on several cross-datasets demonstrate the effectiveness of our improvements and the excellent performance of our proposed NS-Extractor in different application scenarios.

* Accepted by InterSpeech2023 
Viaarxiv icon

Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction

Jun 28, 2023
Aoqi Guo, Junnan Wu, Peng Gao, Wenbo Zhu, Qinwen Guo, Dazhi Gao, Yujun Wang

Figure 1 for Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction
Figure 2 for Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction
Figure 3 for Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction
Figure 4 for Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction

Recently, deep learning-based beamforming algorithms have shown promising performance in target speech extraction tasks. However, most systems do not fully utilize spatial information. In this paper, we propose a target speech extraction network that utilizes spatial information to enhance the performance of neural beamformer. To achieve this, we first use the UNet-TCN structure to model input features and improve the estimation accuracy of the speech pre-separation module by avoiding information loss caused by direct dimensionality reduction in other models. Furthermore, we introduce a multi-head cross-attention mechanism that enhances the neural beamformer's perception of spatial information by making full use of the spatial information received by the array. Experimental results demonstrate that our approach, which incorporates a more reasonable target mask estimation network and a spatial information-based cross-attention mechanism into the neural beamformer, effectively improves speech separation performance.

Viaarxiv icon

AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Jun 25, 2023
Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu, Yujun Wang, Helen Meng

Figure 1 for AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction
Figure 2 for AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction
Figure 3 for AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction
Figure 4 for AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and model features from audio and visual. AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of the visual feature. Then self- and cross-attention are employed to model and fuse the multi-modal features. Furthermore, we use a novel 2D positional encoding, that introduces the positional information between and within chunks and provides significant gains over the traditional positional encoding. Our model has two key advantages: the time granularity of audio chunked feature is synchronized to the visual feature, which alleviates the harm caused by the inconsistency of audio and video sampling rate; by combining self- and cross-attention, feature fusion and speech extraction processes are unified within an attention paradigm. The experimental results show that AV-SepFormer significantly outperforms other existing methods.

* Accepted by ICASSP2023 
Viaarxiv icon

Understanding temporally weakly supervised training: A case study for keyword spotting

May 30, 2023
Heinrich Dinkel, Weiji Zhuang, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

Figure 1 for Understanding temporally weakly supervised training: A case study for keyword spotting
Figure 2 for Understanding temporally weakly supervised training: A case study for keyword spotting
Figure 3 for Understanding temporally weakly supervised training: A case study for keyword spotting
Figure 4 for Understanding temporally weakly supervised training: A case study for keyword spotting

The currently most prominent algorithm to train keyword spotting (KWS) models with deep neural networks (DNNs) requires strong supervision i.e., precise knowledge of the spoken keyword location in time. Thus, most KWS approaches treat the presence of redundant data, such as noise, within their training set as an obstacle. A common training paradigm to deal with data redundancies is to use temporally weakly supervised learning, which only requires providing labels on a coarse scale. This study explores the limits of DNN training using temporally weak labeling with applications in KWS. We train a simple end-to-end classifier on the common Google Speech Commands dataset with increased difficulty by randomly appending and adding noise to the training dataset. Our results indicate that temporally weak labeling can achieve comparable results to strongly supervised baselines while having a less stringent labeling requirement. In the presence of noise, weakly supervised models are capable to localize and extract target keywords without explicit supervision, leading to a performance increase compared to strongly supervised approaches.

Viaarxiv icon

Streaming Audio Transformers for Online Audio Tagging

May 29, 2023
Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

Figure 1 for Streaming Audio Transformers for Online Audio Tagging
Figure 2 for Streaming Audio Transformers for Online Audio Tagging
Figure 3 for Streaming Audio Transformers for Online Audio Tagging
Figure 4 for Streaming Audio Transformers for Online Audio Tagging

Transformers have emerged as a prominent model framework for audio tagging (AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset dataset. However, their impressive performance often comes at the cost of high memory usage, slow inference speed, and considerable model delay, rendering them impractical for real-world AT applications. In this study, we introduce streaming audio transformers (SAT) that combine the vision transformer (ViT) architecture with Transformer-Xl-like chunk processing, enabling efficient processing of long-range audio signals. Our proposed SAT is benchmarked against other transformer-based SOTA methods, achieving significant improvements in terms of mean average precision (mAP) at a delay of 2s and 1s, while also exhibiting significantly lower memory usage and computational overhead. Checkpoints are publicly available https://github.com/RicherMans/SAT.

Viaarxiv icon

Exploring Representation Learning for Small-Footprint Keyword Spotting

Mar 20, 2023
Fan Cui, Liyong Guo, Quandong Wang, Peng Gao, Yujun Wang

Figure 1 for Exploring Representation Learning for Small-Footprint Keyword Spotting
Figure 2 for Exploring Representation Learning for Small-Footprint Keyword Spotting
Figure 3 for Exploring Representation Learning for Small-Footprint Keyword Spotting
Figure 4 for Exploring Representation Learning for Small-Footprint Keyword Spotting

In this paper, we investigate representation learning for low-resource keyword spotting (KWS). The main challenges of KWS are limited labeled data and limited available device resources. To address those challenges, we explore representation learning for KWS by self-supervised contrastive learning and self-training with pretrained model. First, local-global contrastive siamese networks (LGCSiam) are designed to learn similar utterance-level representations for similar audio samplers by proposed local-global contrastive loss without requiring ground-truth. Second, a self-supervised pretrained Wav2Vec 2.0 model is applied as a constraint module (WVC) to force the KWS model to learn frame-level acoustic representations. By the LGCSiam and WVC modules, the proposed small-footprint KWS model can be pretrained with unlabeled data. Experiments on speech commands dataset show that the self-training WVC module and the self-supervised LGCSiam module significantly improve accuracy, especially in the case of training on a small labeled dataset.

Viaarxiv icon

Relate auditory speech to EEG by shallow-deep attention-based network

Mar 20, 2023
Fan Cui, Liyong Guo, Lang He, Jiyao Liu, ErCheng Pei, Yujun Wang, Dongmei Jiang

Figure 1 for Relate auditory speech to EEG by shallow-deep attention-based network
Figure 2 for Relate auditory speech to EEG by shallow-deep attention-based network
Figure 3 for Relate auditory speech to EEG by shallow-deep attention-based network

Electroencephalography (EEG) plays a vital role in detecting how brain responses to different stimulus. In this paper, we propose a novel Shallow-Deep Attention-based Network (SDANet) to classify the correct auditory stimulus evoking the EEG signal. It adopts the Attention-based Correlation Module (ACM) to discover the connection between auditory speech and EEG from global aspect, and the Shallow-Deep Similarity Classification Module (SDSCM) to decide the classification result via the embeddings learned from the shallow and deep layers. Moreover, various training strategies and data augmentation are used to boost the model robustness. Experiments are conducted on the dataset provided by Auditory EEG challenge (ICASSP Signal Processing Grand Challenge 2023). Results show that the proposed model has a significant gain over the baseline on the match-mismatch track.

Viaarxiv icon

Improving Weakly Supervised Sound Event Detection with Causal Intervention

Mar 10, 2023
Yifei Xin, Dongchao Yang, Fan Cui, Yujun Wang, Yuexian Zou

Figure 1 for Improving Weakly Supervised Sound Event Detection with Causal Intervention
Figure 2 for Improving Weakly Supervised Sound Event Detection with Causal Intervention
Figure 3 for Improving Weakly Supervised Sound Event Detection with Causal Intervention

Existing weakly supervised sound event detection (WSSED) work has not explored both types of co-occurrences simultaneously, i.e., some sound events often co-occur, and their occurrences are usually accompanied by specific background sounds, so they would be inevitably entangled, causing misclassification and biased localization results with only clip-level supervision. To tackle this issue, we first establish a structural causal model (SCM) to reveal that the context is the main cause of co-occurrence confounders that mislead the model to learn spurious correlations between frames and clip-level labels. Based on the causal analysis, we propose a causal intervention (CI) method for WSSED to remove the negative impact of co-occurrence confounders by iteratively accumulating every possible context of each class and then re-projecting the contexts to the frame-level features for making the event boundary clearer. Experiments show that our method effectively improves the performance on multiple datasets and can generalize to various baseline models.

* Accepted by ICASSP2023 
Viaarxiv icon