Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hsin-Min Wang

Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

Mar 30, 2022

Fan-Lin Wang, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang

Figure 1 for Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

Figure 2 for Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

Figure 3 for Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

Figure 4 for Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

Abstract:Because the performance of speech separation is excellent for speech in which two speakers completely overlap, research attention has been shifted to dealing with more realistic scenarios. However, domain mismatch between training/test situations due to factors, such as speaker, content, channel, and environment, remains a severe problem for speech separation. Speaker and environment mismatches have been studied in the existing literature. Nevertheless, there are few studies on speech content and channel mismatches. Moreover, the impacts of language and channel in these studies are mostly tangled. In this study, we create several datasets for various experiments. The results show that the impacts of different languages are small enough to be ignored compared to the impacts of different channels. In our experiments, training on data recorded by Android phones leads to the best generalizability. Moreover, we provide a new solution for channel mismatch by evaluating projection, where the channel similarity can be measured and used to effectively select additional training data to improve the performance of in-the-wild test data.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Multi-Target Filter and Detector for Speaker Diarization

Mar 30, 2022

Chin-Yi Cheng, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang

Figure 1 for Multi-Target Filter and Detector for Speaker Diarization

Figure 2 for Multi-Target Filter and Detector for Speaker Diarization

Figure 3 for Multi-Target Filter and Detector for Speaker Diarization

Figure 4 for Multi-Target Filter and Detector for Speaker Diarization

Abstract:A good representation of a target speaker usually helps to extract important information about the speaker and detect the corresponding temporal regions in a multi-speaker conversation. In this paper, we propose a neural architecture that simultaneously extracts speaker embeddings consistent with the speaker diarization objective and detects the presence of each speaker frame by frame, regardless of the number of speakers in the conversation. To this end, a residual network (ResNet) and a dual-path recurrent neural network (DPRNN) are integrated into a unified structure. When tested on the 2-speaker CALLHOME corpus, our proposed model outperforms most methods published so far. Evaluated in a more challenging case of concurrent speakers ranging from two to seven, our system also achieves relative diarization error rate reductions of 26.35% and 6.4% over two typical baselines, namely the traditional x-vector clustering system and the attention-based system.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Chain-based Discriminative Autoencoders for Speech Recognition

Mar 28, 2022

Hung-Shin Lee, Pin-Tuan Huang, Yao-Fei Cheng, Hsin-Min Wang

Figure 1 for Chain-based Discriminative Autoencoders for Speech Recognition

Figure 2 for Chain-based Discriminative Autoencoders for Speech Recognition

Figure 3 for Chain-based Discriminative Autoencoders for Speech Recognition

Abstract:In our previous work, we proposed a discriminative autoencoder (DcAE) for speech recognition. DcAE combines two training schemes into one. First, since DcAE aims to learn encoder-decoder mappings, the squared error between the reconstructed speech and the input speech is minimized. Second, in the code layer, frame-based phonetic embeddings are obtained by minimizing the categorical cross-entropy between ground truth labels and predicted triphone-state scores. DcAE is developed based on the Kaldi toolkit by treating various TDNN models as encoders. In this paper, we further propose three new versions of DcAE. First, a new objective function that considers both categorical cross-entropy and mutual information between ground truth and predicted triphone-state sequences is used. The resulting DcAE is called a chain-based DcAE (c-DcAE). For application to robust speech recognition, we further extend c-DcAE to hierarchical and parallel structures, resulting in hc-DcAE and pc-DcAE. In these two models, both the error between the reconstructed noisy speech and the input noisy speech and the error between the enhanced speech and the reference clean speech are taken into the objective function. Experimental results on the WSJ and Aurora-4 corpora show that our DcAE models outperform baseline systems.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

The VoiceMOS Challenge 2022

Mar 28, 2022

Wen-Chin Huang, Erica Cooper, Yu Tsao, Hsin-Min Wang, Tomoki Toda, Junichi Yamagishi

Figure 1 for The VoiceMOS Challenge 2022

Figure 2 for The VoiceMOS Challenge 2022

Figure 3 for The VoiceMOS Challenge 2022

Figure 4 for The VoiceMOS Challenge 2022

Abstract:We present the first edition of the VoiceMOS Challenge, a scientific event that aims to promote the study of automatic prediction of the mean opinion score (MOS) of synthetic speech. This challenge drew 22 participating teams from academia and industry who tried a variety of approaches to tackle the problem of predicting human ratings of synthesized speech. The listening test data for the main track of the challenge consisted of samples from 187 different text-to-speech and voice conversion systems spanning over a decade of research, and the out-of-domain track consisted of data from more recent systems rated in a separate listening test. Results of the challenge show the effectiveness of fine-tuning self-supervised speech models for the MOS prediction task, as well as the difficulty of predicting MOS ratings for unseen speakers and listeners, and for unseen systems in the out-of-domain setting.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition

Mar 28, 2022

Hung-Shin Lee, Yu Tsao, Shyh-Kang Jeng, Hsin-Min Wang

Figure 1 for Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition

Figure 2 for Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition

Figure 3 for Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition

Figure 4 for Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition

Abstract:Phonotactic constraints can be employed to distinguish languages by representing a speech utterance as a multinomial distribution or phone events. In the present study, we propose a new learning mechanism based on subspace-based representation, which can extract concealed phonotactic structures from utterances, for language verification and dialect/accent identification. The framework mainly involves two successive parts. The first part involves subspace construction. Specifically, it decodes each utterance into a sequence of vectors filled with phone-posteriors and transforms the vector sequence into a linear orthogonal subspace based on low-rank matrix factorization or dynamic linear modeling. The second part involves subspace learning based on kernel machines, such as support vector machines and the newly developed subspace-based neural networks (SNNs). The input layer of SNNs is specifically designed for the sample represented by subspaces. The topology ensures that the same output can be derived from identical subspaces by modifying the conventional feed-forward pass to fit the mathematical definition of subspace similarity. Evaluated on the "General LR" test of NIST LRE 2007, the proposed method achieved up to 52%, 46%, 56%, and 27% relative reductions in equal error rates over the sequence-based PPR-LM, PPR-VSM, and PPR-IVEC methods and the lattice-based PPR-LM method, respectively. Furthermore, on the dialect/accent identification task of NIST LRE 2009, the SNN-based system performed better than the aforementioned four baseline methods.

* Published in IEEE/ACM Trans. Audio, Speech, Lang. Process., 2020, vol. 28, pp. 3065-3079

Via

Access Paper or Ask Questions

Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Mar 25, 2022

Hung-Shin Lee, Pin-Yuan Chen, Yu Tsao, Hsin-Min Wang

Figure 1 for Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Figure 2 for Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Figure 3 for Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Figure 4 for Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Abstract:Compensation for channel mismatch and noise interference is essential for robust automatic speech recognition. Enhanced speech has been introduced into the multi-condition training of acoustic models to improve their generalization ability. In this paper, a noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition. The feature enhancement module is composed of a multi-task autoencoder, where noisy speech is decomposed into clean speech and noise. By concatenating its enhanced, noise-aware, and noisy features for each frame, the acoustic-modeling module maps each feature-augmented frame into a triphone state by optimizing the lattice-free maximum mutual information and cross entropy between the predicted and actual state sequences. On top of the factorized time delay neural network (TDNN-F) and its convolutional variant (CNN-TDNNF), both with SpecAug, the two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task. Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively. In addition, the proposed CNN-TDNNF-based system also outperforms the baseline CNN-TDNNF system on the AMI task.

* submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Partially Fake Audio Detection by Self-attention-based Fake Span Discovery

Feb 15, 2022

Haibin Wu, Heng-Cheng Kuo, Naijun Zheng, Kuo-Hsuan Hung, Hung-Yi Lee, Yu Tsao, Hsin-Min Wang, Helen Meng

Figure 1 for Partially Fake Audio Detection by Self-attention-based Fake Span Discovery

Figure 2 for Partially Fake Audio Detection by Self-attention-based Fake Span Discovery

Figure 3 for Partially Fake Audio Detection by Self-attention-based Fake Span Discovery

Figure 4 for Partially Fake Audio Detection by Self-attention-based Fake Span Discovery

Abstract:The past few years have witnessed the significant advances of speech synthesis and voice conversion technologies. However, such technologies can undermine the robustness of broadly implemented biometric identification models and can be harnessed by in-the-wild attackers for illegal uses. The ASVspoof challenge mainly focuses on synthesized audios by advanced speech synthesis and voice conversion models, and replay attacks. Recently, the first Audio Deep Synthesis Detection challenge (ADD 2022) extends the attack scenarios into more aspects. Also ADD 2022 is the first challenge to propose the partially fake audio detection task. Such brand new attacks are dangerous and how to tackle such attacks remains an open question. Thus, we propose a novel framework by introducing the question-answering (fake span discovery) strategy with the self-attention mechanism to detect partially fake audios. The proposed fake span detection module tasks the anti-spoofing model to predict the start and end positions of the fake clip within the partially fake audio, address the model's attention into discovering the fake spans rather than other shortcuts with less generalization, and finally equips the model with the discrimination capacity between real and partially fake audios. Our submission ranked second in the partially fake audio detection track of ADD 2022.

* Submitted to ICASSP 2022

Via

Access Paper or Ask Questions

EMGSE: Acoustic/EMG Fusion for Multimodal Speech Enhancement

Feb 14, 2022

Kuan-Chen Wang, Kai-Chun Liu, Hsin-Min Wang, Yu Tsao

Figure 1 for EMGSE: Acoustic/EMG Fusion for Multimodal Speech Enhancement

Figure 2 for EMGSE: Acoustic/EMG Fusion for Multimodal Speech Enhancement

Figure 3 for EMGSE: Acoustic/EMG Fusion for Multimodal Speech Enhancement

Figure 4 for EMGSE: Acoustic/EMG Fusion for Multimodal Speech Enhancement

Abstract:Multimodal learning has been proven to be an effective method to improve speech enhancement (SE) performance, especially in challenging situations such as low signal-to-noise ratios, speech noise, or unseen noise types. In previous studies, several types of auxiliary data have been used to construct multimodal SE systems, such as lip images, electropalatography, or electromagnetic midsagittal articulography. In this paper, we propose a novel EMGSE framework for multimodal SE, which integrates audio and facial electromyography (EMG) signals. Facial EMG is a biological signal containing articulatory movement information, which can be measured in a non-invasive way. Experimental results show that the proposed EMGSE system can achieve better performance than the audio-only SE system. The benefits of fusing EMG signals with acoustic signals for SE are notable under challenging circumstances. Furthermore, this study reveals that cheek EMG is sufficient for SE.

* 5 pages, 4 figures, and 3 tables

Via

Access Paper or Ask Questions

Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Dec 01, 2021

Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

Figure 1 for Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Figure 2 for Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Figure 3 for Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Figure 4 for Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Abstract:In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. More specifically, MOSA-Net is designed to estimate the speech quality, intelligibility, and distortion assessment scores of an input test speech signal. It comprises a convolutional neural network and bidirectional long short-term memory (CNN-BLSTM) architecture for representation extraction, and a multiplicative attention layer and a fully-connected layer for each assessment metric. In addition, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned models are used as inputs to combine rich acoustic information from different speech representations to obtain more accurate assessments. Experimental results show that MOSA-Net can precisely predict perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and speech distortion index (SDI) scores when tested on noisy and enhanced speech utterances under either seen test conditions or unseen test conditions. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test.

Via

Access Paper or Ask Questions

HASA-net: A non-intrusive hearing-aid speech assessment network

Nov 10, 2021

Hsin-Tien Chiang, Yi-Chiao Wu, Cheng Yu, Tomoki Toda, Hsin-Min Wang, Yih-Chun Hu, Yu Tsao

Figure 1 for HASA-net: A non-intrusive hearing-aid speech assessment network

Figure 2 for HASA-net: A non-intrusive hearing-aid speech assessment network

Figure 3 for HASA-net: A non-intrusive hearing-aid speech assessment network

Figure 4 for HASA-net: A non-intrusive hearing-aid speech assessment network

Abstract:Without the need of a clean reference, non-intrusive speech assessment methods have caught great attention for objective evaluations. Recently, deep neural network (DNN) models have been applied to build non-intrusive speech assessment approaches and confirmed to provide promising performance. However, most DNN-based approaches are designed for normal-hearing listeners without considering hearing-loss factors. In this study, we propose a DNN-based hearing aid speech assessment network (HASA-Net), formed by a bidirectional long short-term memory (BLSTM) model, to predict speech quality and intelligibility scores simultaneously according to input speech signals and specified hearing-loss patterns. To the best of our knowledge, HASA-Net is the first work to incorporate quality and intelligibility assessments utilizing a unified DNN-based non-intrusive model for hearing aids. Experimental results show that the predicted speech quality and intelligibility scores of HASA-Net are highly correlated to two well-known intrusive hearing-aid evaluation metrics, hearing aid speech quality index (HASQI) and hearing aid speech perception index (HASPI), respectively.

Via

Access Paper or Ask Questions