Adding an emotions using prosody manipulation method for Indonesian text to speech system. Text To Speech (TTS) is a system that can convert text in one language into speech, accordance with the reading of the text in the language used. The focus of this research is a natural sounding concept, the make "humanize" for the pronunciation of voice synthesis system Text To Speech. Humans have emotions / intonation that may affect the sound produced. The main requirement for the system used Text To Speech in this research is eSpeak, the database MBROLA using id1, Human Speech Corpus database from a website that summarizes the words with the highest frequency (Most Common Words) used in a country. And there are 3 types of emotional / intonation designed base. There is a happy, angry and sad emotion. Method for develop the emotional filter is manipulate the relevant features of prosody (especially pitch and duration value) using a predetermined rate factor that has been established by analyzing the differences between the standard output Text To Speech and voice recording with emotional prosody / a particular intonation. The test results for the perception tests of Human Speech Corpus for happy emotion is 95 %, 96.25 % for angry emotion and 98.75 % for sad emotions. For perception test system carried by intelligibility and naturalness test. Intelligibility test for the accuracy of sound with the original sentence is 93.3%, and for clarity rate for each sentence is 62.8%. For naturalness, accuracy emotional election amounted to 75.6 % for happy emotion, 73.3 % for angry emotion, and 60 % for sad emotions. ----- Text To Speech (TTS) merupakan suatu sistem yang dapat mengonversi teks dalam format suatu bahasa menjadi ucapan sesuai dengan pembacaan teks dalam bahasa yang digunakan.
This work presents a text mining context and its use for a deep analysis of the messages delivered by the politicians. Specifically, we deal with an expert systems-based exploration of the rhetoric dynamics of a large collection of US Presidents' speeches, ranging from Washington to Trump. In particular, speeches are viewed as complex expert systems whose structures can be effectively analyzed through rank-size laws. The methodological contribution of the paper is twofold. First, we develop a text mining-based procedure for the construction of the dataset by using a web scraping routine on the Miller Center website -- the repository collecting the speeches. Second, we explore the implicit structure of the discourse data by implementing a rank-size procedure over the individual speeches, being the words of each speech ranked in terms of their frequencies. The scientific significance of the proposed combination of text-mining and rank-size approaches can be found in its flexibility and generality, which let it be reproducible to a wide set of expert systems and text mining contexts. The usefulness of the proposed method and the speech subsequent analysis is demonstrated by the findings themselves. Indeed, in terms of impact, it is worth noting that interesting conclusions of social, political and linguistic nature on how 45 United States Presidents, from April 30, 1789 till February 28, 2017 delivered political messages can be carried out. Indeed, the proposed analysis shows some remarkable regularities, not only inside a given speech, but also among different speeches. Moreover, under a purely methodological perspective, the presented contribution suggests possible ways of generating a linguistic decision-making algorithm.
Automatic speech recognition enables a wide range of current and emerging applications such as automatic transcription, multimedia content analysis, and natural human-computer interfaces. This paper provides a glimpse of the opportunities and challenges that parallelism provides for automatic speech recognition and related application research from the point of view of speech researchers. The increasing parallelism in computing platforms opens three major possibilities for speech recognition systems: improving recognition accuracy in non-ideal, everyday noisy environments; increasing recognition throughput in batch processing of speech data; and reducing recognition latency in realtime usage scenarios. This paper describes technical challenges, approaches taken, and possible directions for future research to guide the design of efficient parallel software and hardware infrastructures.
Speech data is notoriously difficult to work with due to a variety of codecs, lengths of recordings, and meta-data formats. We present Lhotse, a speech data representation library that draws upon lessons learned from Kaldi speech recognition toolkit and brings its concepts into the modern deep learning ecosystem. Lhotse provides a common JSON description format with corresponding Python classes and data preparation recipes for over 30 popular speech corpora. Various datasets can be easily combined together and re-purposed for different tasks. The library handles multi-channel recordings, long recordings, local and cloud storage, lazy and on-the-fly operations amongst other features. We introduce Cut and CutSet concepts, which simplify common data wrangling tasks for audio and help incorporate acoustic context of speech utterances. Finally, we show how Lhotse leverages PyTorch data API abstractions and adopts them to handle speech data for deep learning.
When documenting oral-languages, Unsupervised Word Segmentation (UWS) from speech is a useful, yet challenging, task. It can be performed from phonetic transcriptions, or in the absence of these, from the output of unsupervised speech discretization models. These discretization models are trained using raw speech only, producing discrete speech units which can be applied for downstream (text-based) tasks. In this paper we compare five of these models: three Bayesian and two neural approaches, with regards to the exploitability of the produced units for UWS. Two UWS models are experimented with and we report results for Finnish, Hungarian, Mboshi, Romanian and Russian in a low-resource setting (using only 5k sentences). Our results suggest that neural models for speech discretization are difficult to exploit in our setting, and that it might be necessary to adapt them to limit sequence length. We obtain our best UWS results by using the SHMM and H-SHMM Bayesian models, which produce high quality, yet compressed, discrete representations of the input speech signal.
Cochlear implant (CI) users have considerable difficulty in understanding speech in reverberant listening environments. Time-frequency (T-F) masking is a common technique that aims to improve speech intelligibility by multiplying reverberant speech by a matrix of gain values to suppress T-F bins dominated by reverberation. Recently proposed mask estimation algorithms leverage machine learning approaches to distinguish between target speech and reverberant reflections. However, the spectro-temporal structure of speech is highly variable and dependent on the underlying phoneme. One way to potentially overcome this variability is to leverage explicit knowledge of phonemic information during mask estimation. This study proposes a phoneme-based mask estimation algorithm, where separate mask estimation models are trained for each phoneme. Sentence recognition tests were conducted in normal hearing listeners to determine whether a phoneme-based mask estimation algorithm is beneficial in the ideal scenario where perfect knowledge of the phoneme is available. The results showed that the phoneme-based masks improved the intelligibility of vocoded speech when compared to conventional phoneme-independent masks. The results suggest that a phoneme-based speech enhancement strategy may potentially benefit CI users in reverberant listening environments.
Deep learning based speech enhancement has made rapid development towards improving quality, while models are becoming more compact and usable for real-time on-the-edge inference. However, the speech quality scales directly with the model size, and small models are often still unable to achieve sufficient quality. Furthermore, the introduced speech distortion and artifacts greatly harm speech quality and intelligibility, and often significantly degrade automatic speech recognition (ASR) rates. In this work, we shed light on the success of the spectral complex compressed mean squared error (MSE) loss, and how its magnitude and phase-aware terms are related to the speech distortion vs. noise reduction trade off. We further investigate integrating pre-trained reference-less predictors for mean opinion score (MOS) and word error rate (WER), and pre-trained embeddings on ASR and sound event detection. Our analyses reveal that none of the pre-trained networks added significant performance over the strong spectral loss.
Speech is one of the most effective means of communication and is full of information that helps the transmission of utterer's thoughts. However, mainly due to the cumbersome processing of acoustic features, phoneme or word posterior probability has frequently been discarded in understanding the natural language. Thus, some recent spoken language understanding (SLU) modules have utilized an end-to-end structure that preserves the uncertainty information. This further reduces the propagation of speech recognition error and guarantees computational efficiency. We claim that in this process, the speech comprehension can benefit from the inference of massive pre-trained language models (LMs). We transfer the knowledge from a concrete Transformer-based text LM to an SLU module which can face a data shortage, based on recent cross-modal distillation methodologies. We demonstrate the validity of our proposal upon the performance on the Fluent Speech Command dataset. Thereby, we experimentally verify our hypothesis that the knowledge could be shared from the top layer of the LM to a fully speech-based module, in which the abstracted speech is expected to meet the semantic representation.
A new tightly coupled speech and natural language integration model is presented for a TDNN-based large vocabulary continuous speech recognition system. Unlike the popular n-best techniques developed for integrating mainly HMM-based speech and natural language systems in word level, which is obviously inadequate for the morphologically complex agglutinative languages, our model constructs a spoken language system based on the phoneme-level integration. The TDNN-CYK spoken language architecture is designed and implemented using the TDNN-based diphone recognition module integrated with the table-driven phonological/morphological co-analysis. Our integration model provides a seamless integration of speech and natural language for connectionist speech recognition systems especially for morphologically complex languages such as Korean. Our experiment results show that the speaker-dependent continuous Eojeol (word) recognition can be integrated with the morphological analysis with over 80\% morphological analysis success rate directly from the speech input for the middle-level vocabularies.
The lack of speech data annotated with labels required for spoken language understanding (SLU) is often a major hurdle in building end-to-end (E2E) systems that can directly process speech inputs. In contrast, large amounts of text data with suitable labels are usually available. In this paper, we propose a novel text representation and training methodology that allows E2E SLU systems to be effectively constructed using these text resources. With very limited amounts of additional speech, we show that these models can be further improved to perform at levels close to similar systems built on the full speech datasets. The efficacy of our proposed approach is demonstrated on both intent and entity tasks using three different SLU datasets. With text-only training, the proposed system achieves up to 90% of the performance possible with full speech training. With just an additional 10% of speech data, these models significantly improve further to 97% of full performance.