Alert button
Picture for Yidi Jiang

Yidi Jiang

Alert button

EEG-Derived Voice Signature for Attended Speaker Detection

Aug 28, 2023
Hongxu Zhu, Siqi Cai, Yidi Jiang, Qiquan Zhang, Haizhou Li

Figure 1 for EEG-Derived Voice Signature for Attended Speaker Detection
Figure 2 for EEG-Derived Voice Signature for Attended Speaker Detection
Figure 3 for EEG-Derived Voice Signature for Attended Speaker Detection
Figure 4 for EEG-Derived Voice Signature for Attended Speaker Detection

\textit{Objective:} Conventional EEG-based auditory attention detection (AAD) is achieved by comparing the time-varying speech stimuli and the elicited EEG signals. However, in order to obtain reliable correlation values, these methods necessitate a long decision window, resulting in a long detection latency. Humans have a remarkable ability to recognize and follow a known speaker, regardless of the spoken content. In this paper, we seek to detect the attended speaker among the pre-enrolled speakers from the elicited EEG signals. In this manner, we avoid relying on the speech stimuli for AAD at run-time. In doing so, we propose a novel EEG-based attended speaker detection (E-ASD) task. \textit{Methods:} We encode a speaker's voice with a fixed dimensional vector, known as speaker embedding, and project it to an audio-derived voice signature, which characterizes the speaker's unique voice regardless of the spoken content. We hypothesize that such a voice signature also exists in the listener's brain that can be decoded from the elicited EEG signals, referred to as EEG-derived voice signature. By comparing the audio-derived voice signature and the EEG-derived voice signature, we are able to effectively detect the attended speaker in the listening brain. \textit{Results:} Experiments show that E-ASD can effectively detect the attended speaker from the 0.5s EEG decision windows, achieving 99.78\% AAD accuracy, 99.94\% AUC, and 0.27\% EER. \textit{Conclusion:} We conclude that it is possible to derive the attended speaker's voice signature from the EEG signals so as to detect the attended speaker in a listening brain. \textit{Significance:} We present the first proof of concept for detecting the attended speaker from the elicited EEG signals in a cocktail party environment. The successful implementation of E-ASD marks a non-trivial, but crucial step towards smart hearing aids.

* 8 pages, 2 figures 
Viaarxiv icon

Target Active Speaker Detection with Audio-visual Cues

May 26, 2023
Yidi Jiang, Ruijie Tao, Zexu Pan, Haizhou Li

Figure 1 for Target Active Speaker Detection with Audio-visual Cues
Figure 2 for Target Active Speaker Detection with Audio-visual Cues
Figure 3 for Target Active Speaker Detection with Audio-visual Cues
Figure 4 for Target Active Speaker Detection with Audio-visual Cues

In active speaker detection (ASD), we would like to detect whether an on-screen person is speaking based on audio-visual cues. Previous studies have primarily focused on modeling audio-visual synchronization cue, which depends on the video quality of the lip region of a speaker. In real-world applications, it is possible that we can also have the reference speech of the on-screen speaker. To benefit from both facial cue and reference speech, we propose the Target Speaker TalkNet (TS-TalkNet), which leverages a pre-enrolled speaker embedding to complement the audio-visual synchronization cue in detecting whether the target speaker is speaking. Our framework outperforms the popular model, TalkNet on two datasets, achieving absolute improvements of 1.6\% in mAP on the AVA-ActiveSpeaker validation set, and 0.8\%, 0.4\%, and 0.8\% in terms of AP, AUC and EER on the ASW test set, respectively. Code is available at \href{https://github.com/Jiang-Yidi/TS-TalkNet/}{\color{red}{https://github.com/Jiang-Yidi/TS-TalkNet/}}.

* Accepted to INTERSPEECH2023 
Viaarxiv icon

Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation

Nov 22, 2022
Jiawei Du, Yidi Jiang, Vincent Y. F. Tan, Joey Tianyi Zhou, Haizhou Li

Figure 1 for Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation
Figure 2 for Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation
Figure 3 for Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation
Figure 4 for Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation

Model-based deep learning has achieved astounding successes due in part to the availability of large-scale realworld data. However, processing such massive amounts of data comes at a considerable cost in terms of computations, storage, training and the search for good neural architectures. Dataset distillation has thus recently come to the fore. This paradigm involves distilling information from large real-world datasets into tiny and compact synthetic datasets such that processing the latter yields similar performances as the former. State-of-the-art methods primarily rely on learning the synthetic dataset by matching the gradients obtained during training between the real and synthetic data. However, these gradient-matching methods suffer from the accumulated trajectory error caused by the discrepancy between the distillation and subsequent evaluation. To alleviate the adverse impact of this accumulated trajectory error, we propose a novel approach that encourages the optimization algorithm to seek a flat trajectory. We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory. Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7% on a subset of images of the ImageNet dataset with higher resolution images. We also validate the effectiveness and generalizability of our method with datasets of different resolutions and demonstrate its applicability to neural architecture search.

Viaarxiv icon

Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification

Aug 05, 2021
Yidi Jiang, Bidisha Sharma, Maulik Madhavi, Haizhou Li

Figure 1 for Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification
Figure 2 for Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification
Figure 3 for Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification
Figure 4 for Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification

End-to-end intent classification using speech has numerous advantages compared to the conventional pipeline approach using automatic speech recognition (ASR), followed by natural language processing modules. It attempts to predict intent from speech without using an intermediate ASR module. However, such end-to-end framework suffers from the unavailability of large speech resources with higher acoustic variation in spoken language understanding. In this work, we exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model. In this regard, we leverage the reliable and widely used bidirectional encoder representations from transformers (BERT) model as a language model and transfer the knowledge to build an acoustic model for intent classification using the speech. In particular, a multilevel transformer based teacher-student model is designed, and knowledge distillation is performed across attention and hidden sub-layers of different transformer layers of the student and teacher models. We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively. Further, the proposed method demonstrates better performance and robustness in acoustically degraded condition compared to the baseline method.

* Interspeech 2021 
Viaarxiv icon