This paper proposes a framework for modeling sound change that combines deep convolutional neural networks and iterative learning. Acquisition and transmission of speech across generations is modeled by training generations of Generative Adversarial Networks (Goodfellow et al. arXiv:1406.2661,Donahue et al. arXiv:1705.07904) on unannotated raw speech data. The paper argues that several properties of sound change emerge from the proposed architecture. Four generations of Generative Adversarial Networks were trained on an allophonic distribution in English where voiceless stops are aspirated word-initially before stressed vowels except if preceded by [s]. The first generation of networks is trained on the relevant sequences in human speech from the TIMIT database. The subsequent generations are not trained on TIMIT, but on generated outputs from the previous generation and thus start learning from each other in an iterative learning task. The initial allophonic distribution is progressively being lost with each generation, likely due to pressures from the global distribution of aspiration in the training data that resembles phonological pressures in natural language. The networks show signs of a gradual shift in phonetic targets characteristic of a gradual phonetic sound change. At endpoints, the networks' outputs superficially resemble a phonological change -- rule loss -- driven by imperfect learning. The model features signs of stability, one of the more challenging aspects of computational models of sound change. The results suggest that the proposed Generative Adversarial models of phonetic and phonological acquisition have the potential to yield new insights into the long-standing question of how to model language change.
In daily life, social interaction and acoustic communication often take place in complex acoustic environments (CAE) with a variety of interfering sounds and reverberation. For hearing research and evaluation of hearing systems simulated CAEs using virtual reality techniques have gained interest in the context of ecologically validity. In the current study, the effect of scene complexity and visual representation of the scene on psychoacoustic measures like sound source location, distance perception, loudness, speech intelligibility, and listening effort in a virtual audio-visual environment was investigated. A 3-dimensional, 86-channel loudspeaker array was used to render the sound field in combination with or without a head-mounted display (HMD) to create an immersive stereoscopic visual representation of the scene. The scene consisted of a ring of eight (virtual) loudspeakers which played a target speech stimulus and non-sense speech interferers in several spatial conditions. Either an anechoic (snowy outdoor scenery) or echoic environment (loft apartment) with a reverberation time (T60) of about 1.5 s was simulated. In addition to varying the number of interferers, scene complexity was varied by assessing the psychoacoustic measures in isolated consecutive measurements or simultaneously. Results showed no significant effect of wearing the HMD on the data. Loudness and distance perception showed significantly different results when they were measured simultaneously instead of consecutively in isolation. The advantage of the suggested setup is that it can be directly transferred to a corresponding real room, enabling a 1:1 comparison and verification of the perception experiments in the real and virtual environment.
Deep neural networks (DNNs) have achieved substantial predictive performance in various speech processing tasks. Particularly, it has been shown that a monaural speech separation task can be successfully solved with a DNN-based method called deep clustering (DC), which uses a DNN to describe the process of assigning a continuous vector to each time-frequency (TF) bin and measure how likely each pair of TF bins is to be dominated by the same speaker. In DC, the DNN is trained so that the embedding vectors for the TF bins dominated by the same speaker are forced to get close to each other. One concern regarding DC is that the embedding process described by a DNN has a black-box structure, which is usually very hard to interpret. The potential weakness owing to the non-interpretable black-box structure is that it lacks the flexibility of addressing the mismatch between training and test conditions (caused by reverberation, for instance). To overcome this limitation, in this paper, we propose the concept of explainable deep clustering (X-DC), whose network architecture can be interpreted as a process of fitting learnable spectrogram templates to an input spectrogram followed by Wiener filtering. During training, the elements of the spectrogram templates and their activations are constrained to be non-negative, which facilitates the sparsity of their values and thus improves interpretability. The main advantage of this framework is that it naturally allows us to incorporate a model adaptation mechanism into the network thanks to its physically interpretable structure. We experimentally show that the proposed X-DC enables us to visualize and understand the clues for the model to determine the embedding vectors while achieving speech separation performance comparable to that of the original DC models.
Public sources like parliament meeting recordings and transcripts provide ever-growing material for the training and evaluation of automatic speech recognition (ASR) systems. In this paper, we publish and analyse the Finnish parliament ASR corpus, the largest publicly available collection of manually transcribed speech data for Finnish with over 3000 hours of speech and 449 speakers for which it provides rich demographic metadata. This corpus builds on earlier initial work, and as a result the corpus has a natural split into two training subsets from two periods of time. Similarly, there are two official, corrected test sets covering different times, setting an ASR task with longitudinal distribution-shift characteristics. An official development set is also provided. We develop a complete Kaldi-based data preparation pipeline, and hidden Markov model (HMM), hybrid deep neural network (HMM-DNN) and attention-based encoder-decoder (AED) ASR recipes. We set benchmarks on the official test sets, as well as multiple other recently used test sets. Both temporal corpus subsets are already large, and we observe that beyond their scale, ASR performance on the official test sets plateaus, whereas other domains benefit from added data. The HMM-DNN and AED approaches are compared in a carefully matched equal data setting, with the HMM-DNN system consistently performing better. Finally, the variation of the ASR accuracy is compared between the speaker categories available in the parliament metadata to detect potential biases based on factors such as gender, age, and education.
The Bangla language is the seventh most spoken language, with 265 million native and non-native speakers worldwide. However, English is the predominant language for online resources and technical knowledge, journals, and documentation. Consequently, many Bangla-speaking people, who have limited command of English, face hurdles to utilize English resources. To bridge the gap between limited support and increasing demand, researchers conducted many experiments and developed valuable tools and techniques to create and process Bangla language materials. Many efforts are also ongoing to make it easy to use the Bangla language in the online and technical domains. There are some review papers to understand the past, previous, and future Bangla Natural Language Processing (BNLP) trends. The studies are mainly concentrated on the specific domains of BNLP, such as sentiment analysis, speech recognition, optical character recognition, and text summarization. There is an apparent scarcity of resources that contain a comprehensive study of the recent BNLP tools and methods. Therefore, in this paper, we present a thorough review of 71 BNLP research papers and categorize them into 11 categories, namely Information Extraction, Machine Translation, Named Entity Recognition, Parsing, Parts of Speech Tagging, Question Answering System, Sentiment Analysis, Spam and Fake Detection, Text Summarization, Word Sense Disambiguation, and Speech Processing and Recognition. We study articles published between 1999 to 2021, and 50% of the papers were published after 2015. We discuss Classical, Machine Learning and Deep Learning approaches with different datasets while addressing the limitations and current and future trends of the BNLP.
The average predictability (aka informativity) of a word in context has been shown to condition word duration (Seyfarth, 2014). All else being equal, words that tend to occur in more predictable environments are shorter than words that tend to occur in less predictable environments. One account of the informativity effect on duration is that the acoustic details of word reduction are stored as part of a word's representation. Other research has argued that predictability effects are tied to prosodic structure in integral ways. With the aim of assessing a potential prosodic basis for informativity effects in speech production, this study extends past work in two directions; it investigated informativity effects in another large language, Mandarin Chinese, and broadened the study beyond word duration to additional acoustic dimensions, pitch and intensity, known to index prosodic prominence. The acoustic information of content words was extracted from a large telephone conversation speech corpus with over 400,000 tokens and 6,000 word types spoken by 1,655 individuals and analyzed for the effect of informativity using frequency statistics estimated from a 431 million word subtitle corpus. Results indicated that words with low informativity have shorter durations, replicating the effect found in English. In addition, informativity had significant effects on maximum pitch and intensity, two phonetic dimensions related to prosodic prominence. Extending this interpretation, these results suggest that informativity is closely linked to prosodic prominence, and that lexical representation of a word includes phonetic details associated with its prosodic prominence. In other words, the lexicon absorbs prosodic influences on speech production.
Speech synthesis, voice cloning, and voice conversion techniques present severe privacy and security threats to users of voice user interfaces (VUIs). These techniques transform one or more elements of a speech signal, e.g., identity and emotion, while preserving linguistic information. Adversaries may use advanced transformation tools to trigger a spoofing attack using fraudulent biometrics for a legitimate speaker. Conversely, such techniques have been used to generate privacy-transformed speech by suppressing personally identifiable attributes in the voice signals, achieving anonymization. Prior works have studied the security and privacy vectors in parallel, and thus it raises alarm that if a benign user can achieve privacy by a transformation, it also means that a malicious user can break security by bypassing the anti-spoofing mechanism. In this paper, we take a step towards balancing two seemingly conflicting requirements: security and privacy. It remains unclear what the vulnerabilities in one domain imply for the other, and what dynamic interactions exist between them. A better understanding of these aspects is crucial for assessing and mitigating vulnerabilities inherent with VUIs and building effective defenses. In this paper,(i) we investigate the applicability of the current voice anonymization methods by deploying a tandem framework that jointly combines anti-spoofing and authentication models, and evaluate the performance of these methods;(ii) examining analytical and empirical evidence, we reveal a duality between the two mechanisms as they offer different ways to achieve the same objective, and we show that leveraging one vector significantly amplifies the effectiveness of the other;(iii) we demonstrate that to effectively defend from potential attacks against VUIs, it is necessary to investigate the attacks from multiple complementary perspectives(security and privacy).
Lip reading has received an increasing research interest in recent years due to the rapid development of deep learning and its widespread potential applications. One key point to obtain good performance for the lip reading task depends heavily on how effective the representation can be to capture the lip movement information and meanwhile to resist the noises resulted from the change of pose, lighting conditions, speaker's appearance and so on. Towards this target, we propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level to enhance the relations of the features with the speech content. On the one hand, we constraint the features generated at each time step to enable them carry a strong relation with the speech content by imposing the local mutual information maximization constraint (LMIM), leading to improvements over the model's ability to discover fine-grained lip movements and the fine-grained differences among words with similar pronunciation, such as ``spend'' and ``spending''. On the other hand, we introduce the mutual information maximization constraint on the global sequence's level (GMIM), to make the model be able to pay more attention to discriminate key frames related with the speech content, and less to various noises appeared in the speaking process. By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading. To verify this method, we evaluate on two large-scale benchmark. We perform a detailed analysis and comparison on several aspects, including the comparison of the LMIM and GMIM with the baseline, the visualization of the learned representation and so on. The results not only prove the effectiveness of the proposed method but also report new state-of-the-art performance on both the two benchmarks.
Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST. This paper focuses on audio and speech classification, and aims to alleviate the data requirement issues with the AST by leveraging self-supervised learning using unlabeled data. Specifically, we propose to pretrain the AST model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio from AudioSet and Librispeech. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.