Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hung-Shin Lee

Generation of Speaker Representations Using Heterogeneous Training Batch Assembly

Mar 30, 2022

Yu-Huai Peng, Hung-Shin Lee, Pin-Tuan Huang, Hsin-Min Wang

Figure 1 for Generation of Speaker Representations Using Heterogeneous Training Batch Assembly

Figure 2 for Generation of Speaker Representations Using Heterogeneous Training Batch Assembly

Figure 3 for Generation of Speaker Representations Using Heterogeneous Training Batch Assembly

Figure 4 for Generation of Speaker Representations Using Heterogeneous Training Batch Assembly

Abstract:In traditional speaker diarization systems, a well-trained speaker model is a key component to extract representations from consecutive and partially overlapping segments in a long speech session. To be more consistent with the back-end segmentation and clustering, we propose a new CNN-based speaker modeling scheme, which takes into account the heterogeneity of the speakers in each training segment and batch. We randomly and synthetically augment the training data into a set of segments, each of which contains more than one speaker and some overlapping parts. A soft label is imposed on each segment based on its speaker occupation ratio, and the standard cross entropy loss is implemented in model training. In this way, the speaker model should have the ability to generate a geometrically meaningful embedding for each multi-speaker segment. Experimental results show that our system is superior to the baseline system using x-vectors in two speaker diarization tasks. In the CALLHOME task trained on the NIST SRE and Switchboard datasets, our system achieves a relative reduction of 12.93% in DER. In Track 2 of CHiME-6, our system provides 13.24%, 12.60%, and 5.65% relative reductions in DER, JER, and WER, respectively.

* Published in APSIPA ASC 2021

Via

Access Paper or Ask Questions

Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

Mar 30, 2022

Fan-Lin Wang, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang

Figure 1 for Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

Figure 2 for Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

Figure 3 for Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

Figure 4 for Disentangling the Impacts of Language and Channel Variability on Speech Separation Networks

Abstract:Because the performance of speech separation is excellent for speech in which two speakers completely overlap, research attention has been shifted to dealing with more realistic scenarios. However, domain mismatch between training/test situations due to factors, such as speaker, content, channel, and environment, remains a severe problem for speech separation. Speaker and environment mismatches have been studied in the existing literature. Nevertheless, there are few studies on speech content and channel mismatches. Moreover, the impacts of language and channel in these studies are mostly tangled. In this study, we create several datasets for various experiments. The results show that the impacts of different languages are small enough to be ignored compared to the impacts of different channels. In our experiments, training on data recorded by Android phones leads to the best generalizability. Moreover, we provide a new solution for channel mismatch by evaluating projection, where the channel similarity can be measured and used to effectively select additional training data to improve the performance of in-the-wild test data.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Multi-Target Filter and Detector for Speaker Diarization

Mar 30, 2022

Chin-Yi Cheng, Hung-Shin Lee, Yu Tsao, Hsin-Min Wang

Figure 1 for Multi-Target Filter and Detector for Speaker Diarization

Figure 2 for Multi-Target Filter and Detector for Speaker Diarization

Figure 3 for Multi-Target Filter and Detector for Speaker Diarization

Figure 4 for Multi-Target Filter and Detector for Speaker Diarization

Abstract:A good representation of a target speaker usually helps to extract important information about the speaker and detect the corresponding temporal regions in a multi-speaker conversation. In this paper, we propose a neural architecture that simultaneously extracts speaker embeddings consistent with the speaker diarization objective and detects the presence of each speaker frame by frame, regardless of the number of speakers in the conversation. To this end, a residual network (ResNet) and a dual-path recurrent neural network (DPRNN) are integrated into a unified structure. When tested on the 2-speaker CALLHOME corpus, our proposed model outperforms most methods published so far. Evaluated in a more challenging case of concurrent speakers ranging from two to seven, our system also achieves relative diarization error rate reductions of 26.35% and 6.4% over two typical baselines, namely the traditional x-vector clustering system and the attention-based system.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Chain-based Discriminative Autoencoders for Speech Recognition

Mar 28, 2022

Hung-Shin Lee, Pin-Tuan Huang, Yao-Fei Cheng, Hsin-Min Wang

Figure 1 for Chain-based Discriminative Autoencoders for Speech Recognition

Figure 2 for Chain-based Discriminative Autoencoders for Speech Recognition

Figure 3 for Chain-based Discriminative Autoencoders for Speech Recognition

Abstract:In our previous work, we proposed a discriminative autoencoder (DcAE) for speech recognition. DcAE combines two training schemes into one. First, since DcAE aims to learn encoder-decoder mappings, the squared error between the reconstructed speech and the input speech is minimized. Second, in the code layer, frame-based phonetic embeddings are obtained by minimizing the categorical cross-entropy between ground truth labels and predicted triphone-state scores. DcAE is developed based on the Kaldi toolkit by treating various TDNN models as encoders. In this paper, we further propose three new versions of DcAE. First, a new objective function that considers both categorical cross-entropy and mutual information between ground truth and predicted triphone-state sequences is used. The resulting DcAE is called a chain-based DcAE (c-DcAE). For application to robust speech recognition, we further extend c-DcAE to hierarchical and parallel structures, resulting in hc-DcAE and pc-DcAE. In these two models, both the error between the reconstructed noisy speech and the input noisy speech and the error between the enhanced speech and the reference clean speech are taken into the objective function. Experimental results on the WSJ and Aurora-4 corpora show that our DcAE models outperform baseline systems.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition

Mar 28, 2022

Hung-Shin Lee, Yu Tsao, Shyh-Kang Jeng, Hsin-Min Wang

Figure 1 for Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition

Figure 2 for Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition

Figure 3 for Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition

Figure 4 for Subspace-based Representation and Learning for Phonotactic Spoken Language Recognition

Abstract:Phonotactic constraints can be employed to distinguish languages by representing a speech utterance as a multinomial distribution or phone events. In the present study, we propose a new learning mechanism based on subspace-based representation, which can extract concealed phonotactic structures from utterances, for language verification and dialect/accent identification. The framework mainly involves two successive parts. The first part involves subspace construction. Specifically, it decodes each utterance into a sequence of vectors filled with phone-posteriors and transforms the vector sequence into a linear orthogonal subspace based on low-rank matrix factorization or dynamic linear modeling. The second part involves subspace learning based on kernel machines, such as support vector machines and the newly developed subspace-based neural networks (SNNs). The input layer of SNNs is specifically designed for the sample represented by subspaces. The topology ensures that the same output can be derived from identical subspaces by modifying the conventional feed-forward pass to fit the mathematical definition of subspace similarity. Evaluated on the "General LR" test of NIST LRE 2007, the proposed method achieved up to 52%, 46%, 56%, and 27% relative reductions in equal error rates over the sequence-based PPR-LM, PPR-VSM, and PPR-IVEC methods and the lattice-based PPR-LM method, respectively. Furthermore, on the dialect/accent identification task of NIST LRE 2009, the SNN-based system performed better than the aforementioned four baseline methods.

* Published in IEEE/ACM Trans. Audio, Speech, Lang. Process., 2020, vol. 28, pp. 3065-3079

Via

Access Paper or Ask Questions

Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Mar 25, 2022

Hung-Shin Lee, Pin-Yuan Chen, Yu Tsao, Hsin-Min Wang

Figure 1 for Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Figure 2 for Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Figure 3 for Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Figure 4 for Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

Abstract:Compensation for channel mismatch and noise interference is essential for robust automatic speech recognition. Enhanced speech has been introduced into the multi-condition training of acoustic models to improve their generalization ability. In this paper, a noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition. The feature enhancement module is composed of a multi-task autoencoder, where noisy speech is decomposed into clean speech and noise. By concatenating its enhanced, noise-aware, and noisy features for each frame, the acoustic-modeling module maps each feature-augmented frame into a triphone state by optimizing the lattice-free maximum mutual information and cross entropy between the predicted and actual state sequences. On top of the factorized time delay neural network (TDNN-F) and its convolutional variant (CNN-TDNNF), both with SpecAug, the two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task. Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively. In addition, the proposed CNN-TDNNF-based system also outperforms the baseline CNN-TDNNF system on the AMI task.

* submitted to Interspeech 2022

Via

Access Paper or Ask Questions

SurpriseNet: Melody Harmonization Conditioning on User-controlled Surprise Contours

Aug 24, 2021

Yi-Wei Chen, Hung-Shin Lee, Yen-Hsing Chen, Hsin-Min Wang

Figure 1 for SurpriseNet: Melody Harmonization Conditioning on User-controlled Surprise Contours

Figure 2 for SurpriseNet: Melody Harmonization Conditioning on User-controlled Surprise Contours

Figure 3 for SurpriseNet: Melody Harmonization Conditioning on User-controlled Surprise Contours

Figure 4 for SurpriseNet: Melody Harmonization Conditioning on User-controlled Surprise Contours

Abstract:The surprisingness of a song is an essential and seemingly subjective factor in determining whether the listener likes it. With the help of information theory, it can be described as the transition probability of a music sequence modeled as a Markov chain. In this study, we introduce the concept of deriving entropy variations over time, so that the surprise contour of each chord sequence can be extracted. Based on this, we propose a user-controllable framework that uses a conditional variational autoencoder (CVAE) to harmonize the melody based on the given chord surprise indication. Through explicit conditions, the model can randomly generate various and harmonic chord progressions for a melody, and the Spearman's correlation and p-value significance show that the resulting chord progressions match the given surprise contour quite well. The vanilla CVAE model was evaluated in a basic melody harmonization task (no surprise control) in terms of six objective metrics. The results of experiments on the Hooktheory Lead Sheet Dataset show that our model achieves performance comparable to the state-of-the-art melody harmonization model.

* Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021

Via

Access Paper or Ask Questions

Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation

Jun 14, 2021

Fan-Lin Wang, Yu-Huai Peng, Hung-Shin Lee, Hsin-Min Wang

Figure 1 for Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation

Figure 2 for Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation

Figure 3 for Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation

Figure 4 for Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation

Abstract:Speech separation has been extensively studied to deal with the cocktail party problem in recent years. All related approaches can be divided into two categories: time-frequency domain methods and time domain methods. In addition, some methods try to generate speaker vectors to support source separation. In this study, we propose a new model called dual-path filter network (DPFN). Our model focuses on the post-processing of speech separation to improve speech separation performance. DPFN is composed of two parts: the speaker module and the separation module. First, the speaker module infers the identities of the speakers. Then, the separation module uses the speakers' information to extract the voices of individual speakers from the mixture. DPFN constructed based on DPRNN-TasNet is not only superior to DPRNN-TasNet, but also avoids the problem of permutation-invariant training (PIT).

* Accepted by Interspeech2021

Via

Access Paper or Ask Questions

Relational Data Selection for Data Augmentation of Speaker-dependent Multi-band MelGAN Vocoder

Jun 10, 2021

Yi-Chiao Wu, Cheng-Hung Hu, Hung-Shin Lee, Yu-Huai Peng, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang, Tomoki Toda

Figure 1 for Relational Data Selection for Data Augmentation of Speaker-dependent Multi-band MelGAN Vocoder

Figure 2 for Relational Data Selection for Data Augmentation of Speaker-dependent Multi-band MelGAN Vocoder

Figure 3 for Relational Data Selection for Data Augmentation of Speaker-dependent Multi-band MelGAN Vocoder

Abstract:Nowadays, neural vocoders can generate very high-fidelity speech when a bunch of training data is available. Although a speaker-dependent (SD) vocoder usually outperforms a speaker-independent (SI) vocoder, it is impractical to collect a large amount of data of a specific target speaker for most real-world applications. To tackle the problem of limited target data, a data augmentation method based on speaker representation and similarity measurement of speaker verification is proposed in this paper. The proposed method selects utterances that have similar speaker identity to the target speaker from an external corpus, and then combines the selected utterances with the limited target data for SD vocoder adaptation. The evaluation results show that, compared with the vocoder adapted using only limited target data, the vocoder adapted using augmented data improves both the quality and similarity of synthesized speech.

* 5 pages, 1 figure, 3 tables, Proc. Interspeech, 2021

Via

Access Paper or Ask Questions

AlloST: Low-resource Speech Translation without Source Transcription

May 01, 2021

Yao-Fei Cheng, Hung-Shin Lee, Hsin-Min Wang

Figure 1 for AlloST: Low-resource Speech Translation without Source Transcription

Figure 2 for AlloST: Low-resource Speech Translation without Source Transcription

Figure 3 for AlloST: Low-resource Speech Translation without Source Transcription

Figure 4 for AlloST: Low-resource Speech Translation without Source Transcription

Abstract:The end-to-end architecture has made promising progress in speech translation (ST). However, the ST task is still challenging under low-resource conditions. Most ST models have shown unsatisfactory results, especially in the absence of word information from the source speech utterance. In this study, we survey methods to improve ST performance without using source transcription, and propose a learning framework that utilizes a language-independent universal phone recognizer. The framework is based on an attention-based sequence-to-sequence model, where the encoder generates the phonetic embeddings and phone-aware acoustic representations, and the decoder controls the fusion of the two embedding streams to produce the target token sequence. In addition to investigating different fusion strategies, we explore the specific usage of byte pair encoding (BPE), which compresses a phone sequence into a syllable-like segmented sequence with semantic information. Experiments conducted on the Fisher Spanish-English and Taigi-Mandarin drama corpora show that our method outperforms the conformer-based baseline, and the performance is close to that of the existing best method using source transcription.

Via

Access Paper or Ask Questions