Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Multimodal Clustering with Role Induced Constraints for Speaker Diarization

Apr 01, 2022
Nikolaos Flemotomos, Shrikanth Narayanan

Speaker clustering is an essential step in conventional speaker diarization systems and is typically addressed as an audio-only speech processing task. The language used by the participants in a conversation, however, carries additional information that can help improve the clustering performance. This is especially true in conversational interactions, such as business meetings, interviews, and lectures, where specific roles assumed by interlocutors (manager, client, teacher, etc.) are often associated with distinguishable linguistic patterns. In this paper we propose to employ a supervised text-based model to extract speaker roles and then use this information to guide an audio-based spectral clustering step by imposing must-link and cannot-link constraints between segments. The proposed method is applied on two different domains, namely on medical interactions and on podcast episodes, and is shown to yield improved results when compared to the audio-only approach.

* Submitted at Interspeech 2022 

  Access Paper or Ask Questions

Slangvolution: A Causal Analysis of Semantic Change and Frequency Dynamics in Slang

Mar 09, 2022
Daphna Keidar, Andreas Opedal, Zhijing Jin, Mrinmaya Sachan

Languages are continuously undergoing changes, and the mechanisms that underlie these changes are still a matter of debate. In this work, we approach language evolution through the lens of causality in order to model not only how various distributional factors associate with language change, but how they causally affect it. In particular, we study slang, which is an informal language that is typically restricted to a specific group or social setting. We analyze the semantic change and frequency shift of slang words and compare them to those of standard, nonslang words. With causal discovery and causal inference techniques, we measure the effect that word type (slang/nonslang) has on both semantic change and frequency shift, as well as its relationship to frequency, polysemy and part of speech. Our analysis provides some new insights in the study of semantic change, e.g., we show that slang words undergo less semantic change but tend to have larger frequency shifts over time.

* Accepted as a main conference paper at ACL 2022 

  Access Paper or Ask Questions

The HCCL-DKU system for fake audio generation task of the 2022 ICASSP ADD Challenge

Jan 29, 2022
Ziyi Chen, Hua Hua, Yuxiang Zhang, Ming Li, Pengyuan Zhang

The voice conversion task is to modify the speaker identity of continuous speech while preserving the linguistic content. Generally, the naturalness and similarity are two main metrics for evaluating the conversion quality, which has been improved significantly in recent years. This paper presents the HCCL-DKU entry for the fake audio generation task of the 2022 ICASSP ADD challenge. We propose a novel ppg-based voice conversion model that adopts a fully end-to-end structure. Experimental results show that the proposed method outperforms other conversion models, including Tacotron-based and Fastspeech-based models, on conversion quality and spoofing performance against anti-spoofing systems. In addition, we investigate several post-processing methods for better spoofing power. Finally, we achieve second place with a deception success rate of 0.916 in the ADD challenge.

  Access Paper or Ask Questions

SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System for Both Human Beings and Machines

Nov 06, 2021
Zhang Haozhe, Cai Zexin, Qin Xiaoyi, Li Ming

Nowadays, as more and more systems achieve good performance in traditional voice conversion (VC) tasks, people's attention gradually turns to VC tasks under extreme conditions. In this paper, we propose a novel method for zero-shot voice conversion. We aim to obtain intermediate representations for speaker-content disentanglement of speech to better remove speaker information and get pure content information. Accordingly, our proposed framework contains a module that removes the speaker information from the acoustic feature of the source speaker. Moreover, speaker information control is added to our system to maintain the voice cloning performance. The proposed system is evaluated by subjective and objective metrics. Results show that our proposed system significantly reduces the trade-off problem in zero-shot voice conversion, while it also manages to have high spoofing power to the speaker verification system.

  Access Paper or Ask Questions

Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words

Oct 29, 2021
Santiago Cuervo, Maciej Grabias, Jan Chorowski, Grzegorz Ciesielski, Adrian Łańcucki, Paweł Rychlikowski, Ricard Marxer

We investigate the performance on phoneme categorization and phoneme and word segmentation of several self-supervised learning (SSL) methods based on Contrastive Predictive Coding (CPC). Our experiments show that with the existing algorithms there is a trade off between categorization and segmentation performance. We investigate the source of this conflict and conclude that the use of context building networks, albeit necessary for superior performance on categorization tasks, harms segmentation performance by causing a temporal shift on the learned representations. Aiming to bridge this gap, we take inspiration from the leading approach on segmentation, which simultaneously models the speech signal at the frame and phoneme level, and incorporate multi-level modelling into Aligned CPC (ACPC), a variation of CPC which exhibits the best performance on categorization tasks. Our multi-level ACPC (mACPC) improves in all categorization metrics and achieves state-of-the-art performance in word segmentation.

  Access Paper or Ask Questions

Efficient Sequence Training of Attention Models using Approximative Recombination

Oct 18, 2021
Nils-Philipp Wynands, Wilfried Michel, Jan Rosendahl, Ralf Schlüter, Hermann Ney

Sequence discriminative training is a great tool to improve the performance of an automatic speech recognition system. It does, however, necessitate a sum over all possible word sequences, which is intractable to compute in practice. Current state-of-the-art systems with unlimited label context circumvent this problem by limiting the summation to an n-best list of relevant competing hypotheses obtained from beam search. This work proposes to perform (approximative) recombinations of hypotheses during beam search, if they share a common local history. The error that is incurred by the approximation is analyzed and it is shown that using this technique the effective beam size can be increased by several orders of magnitude without significantly increasing the computational requirements. Lastly, it is shown that this technique can be used to effectively perform sequence discriminative training for attention-based encoder-decoder acoustic models on the LibriSpeech task.

* submitted to ICASSP 2022 

  Access Paper or Ask Questions

Optimized Power Normalized Cepstral Coefficients towards Robust Deep Speaker Verification

Sep 24, 2021
Xuechen Liu, Md Sahidullah, Tomi Kinnunen

After their introduction to robust speech recognition, power normalized cepstral coefficient (PNCC) features were successfully adopted to other tasks, including speaker verification. However, as a feature extractor with long-term operations on the power spectrogram, its temporal processing and amplitude scaling steps dedicated on environmental compensation may be redundant. Further, they might suppress intrinsic speaker variations that are useful for speaker verification based on deep neural networks (DNN). Therefore, in this study, we revisit and optimize PNCCs by ablating its medium-time processor and by introducing channel energy normalization. Experimental results with a DNN-based speaker verification system indicate substantial improvement over baseline PNCCs on both in-domain and cross-domain scenarios, reflected by relatively 5.8% and 61.2% maximum lower equal error rate on VoxCeleb1 and VoxMovies, respectively.

* Accepted for publication at ASRU 2021 

  Access Paper or Ask Questions

CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

Jun 06, 2021
Tatiana Likhomanenko, Qiantong Xu, Ronan Collobert, Gabriel Synnaeve, Alex Rogozhnikov

Without positional information, attention-based transformer neural networks are permutation-invariant. Absolute or relative positional embeddings are the most popular ways to feed transformer models positional information. Absolute positional embeddings are simple to implement, but suffer from generalization issues when evaluating on sequences of different length than those seen at training time. Relative positions are more robust to length change, but are more complex to implement and yield inferior model throughput. In this paper, we propose an augmentation-based approach (CAPE) for absolute positional embeddings, which keeps the advantages of both absolute (simplicity and speed) and relative position embeddings (better generalization). In addition, our empirical evaluation on state-of-the-art models in machine translation, image and speech recognition demonstrates that CAPE leads to better generalization performance as well as increased stability with respect to training hyper-parameters.

  Access Paper or Ask Questions

Sentiment Classification in Swahili Language Using Multilingual BERT

Apr 19, 2021
Gati L. Martin, Medard E. Mswahili, Young-Seob Jeong

The evolution of the Internet has increased the amount of information that is expressed by people on different platforms. This information can be product reviews, discussions on forums, or social media platforms. Accessibility of these opinions and peoples feelings open the door to opinion mining and sentiment analysis. As language and speech technologies become more advanced, many languages have been used and the best models have been obtained. However, due to linguistic diversity and lack of datasets, African languages have been left behind. In this study, by using the current state-of-the-art model, multilingual BERT, we perform sentiment classification on Swahili datasets. The data was created by extracting and annotating 8.2k reviews and comments on different social media platforms and the ISEAR emotion dataset. The data were classified as either positive or negative. The model was fine-tuned and achieve the best accuracy of 87.59%.

* Accepted to African NLP Workshop, EACL 2021 (non-archival) 

  Access Paper or Ask Questions

Child-directed Listening: How Caregiver Inference Enables Children's Early Verbal Communication

Feb 09, 2021
Stephan C. Meylan, Ruthe Foushee, Elika Bergelson, Roger P. Levy

How do adults understand children's speech? Children's productions over the course of language development often bear little resemblance to typical adult pronunciations, yet caregivers nonetheless reliably recover meaning from them. Here, we employ a suite of Bayesian models of spoken word recognition to understand how adults overcome the noisiness of child language, showing that communicative success between children and adults relies heavily on adult inferential processes. By evaluating competing models on phonetically-annotated corpora, we show that adults' recovered meanings are best predicted by prior expectations fitted specifically to the child language environment, rather than to typical adult-adult language. After quantifying the contribution of this "child-directed listening" over developmental time, we discuss the consequences for theories of language acquisition, as well as the implications for commonly-used methods for assessing children's linguistic proficiency.

* 13 pages, 3 figures, 2 tables. Edit #1 fixes formatting on table 1 (fitting it onto a single page) and reports correct contents for table 1 (previous version reported ants, not bits) 

  Access Paper or Ask Questions