Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Penambahan emosi menggunakan metode manipulasi prosodi untuk sistem text to speech bahasa Indonesia

Jun 29, 2016
Salita Ulitia Prini, Ary Setijadi Prihatmanto

Adding an emotions using prosody manipulation method for Indonesian text to speech system. Text To Speech (TTS) is a system that can convert text in one language into speech, accordance with the reading of the text in the language used. The focus of this research is a natural sounding concept, the make "humanize" for the pronunciation of voice synthesis system Text To Speech. Humans have emotions / intonation that may affect the sound produced. The main requirement for the system used Text To Speech in this research is eSpeak, the database MBROLA using id1, Human Speech Corpus database from a website that summarizes the words with the highest frequency (Most Common Words) used in a country. And there are 3 types of emotional / intonation designed base. There is a happy, angry and sad emotion. Method for develop the emotional filter is manipulate the relevant features of prosody (especially pitch and duration value) using a predetermined rate factor that has been established by analyzing the differences between the standard output Text To Speech and voice recording with emotional prosody / a particular intonation. The test results for the perception tests of Human Speech Corpus for happy emotion is 95 %, 96.25 % for angry emotion and 98.75 % for sad emotions. For perception test system carried by intelligibility and naturalness test. Intelligibility test for the accuracy of sound with the original sentence is 93.3%, and for clarity rate for each sentence is 62.8%. For naturalness, accuracy emotional election amounted to 75.6 % for happy emotion, 73.3 % for angry emotion, and 60 % for sad emotions. ----- Text To Speech (TTS) merupakan suatu sistem yang dapat mengonversi teks dalam format suatu bahasa menjadi ucapan sesuai dengan pembacaan teks dalam bahasa yang digunakan.

* Keywords: Text To Speech; eSpeak; MBROLA; Human Speech Corpus; emotion; intonation; prosody manipulation; emosi; intonasi; manipulasi prosodi 

  Access Paper or Ask Questions

A joint text mining-rank size investigation of the rhetoric structures of the US Presidents' speeches

May 09, 2019
Valerio Ficcadenti, Roy Cerqueti, Marcel Ausloos

This work presents a text mining context and its use for a deep analysis of the messages delivered by the politicians. Specifically, we deal with an expert systems-based exploration of the rhetoric dynamics of a large collection of US Presidents' speeches, ranging from Washington to Trump. In particular, speeches are viewed as complex expert systems whose structures can be effectively analyzed through rank-size laws. The methodological contribution of the paper is twofold. First, we develop a text mining-based procedure for the construction of the dataset by using a web scraping routine on the Miller Center website -- the repository collecting the speeches. Second, we explore the implicit structure of the discourse data by implementing a rank-size procedure over the individual speeches, being the words of each speech ranked in terms of their frequencies. The scientific significance of the proposed combination of text-mining and rank-size approaches can be found in its flexibility and generality, which let it be reproducible to a wide set of expert systems and text mining contexts. The usefulness of the proposed method and the speech subsequent analysis is demonstrated by the findings themselves. Indeed, in terms of impact, it is worth noting that interesting conclusions of social, political and linguistic nature on how 45 United States Presidents, from April 30, 1789 till February 28, 2017 delivered political messages can be carried out. Indeed, the proposed analysis shows some remarkable regularities, not only inside a given speech, but also among different speeches. Moreover, under a purely methodological perspective, the presented contribution suggests possible ways of generating a linguistic decision-making algorithm.

* Expert Systems With Applications 123 (2019) 127-142 
* prepared for Expert Systems With Applications; 24 figures; 7 tables; 89 references 

  Access Paper or Ask Questions

Opportunities & Challenges In Automatic Speech Recognition

May 09, 2013
Rashmi Makhijani, Urmila Shrawankar, V M Thakare

Automatic speech recognition enables a wide range of current and emerging applications such as automatic transcription, multimedia content analysis, and natural human-computer interfaces. This paper provides a glimpse of the opportunities and challenges that parallelism provides for automatic speech recognition and related application research from the point of view of speech researchers. The increasing parallelism in computing platforms opens three major possibilities for speech recognition systems: improving recognition accuracy in non-ideal, everyday noisy environments; increasing recognition throughput in batch processing of speech data; and reducing recognition latency in realtime usage scenarios. This paper describes technical challenges, approaches taken, and possible directions for future research to guide the design of efficient parallel software and hardware infrastructures.

* Pages: 05 Figures : 01 Proceedings of the International Conference BEATS 2010, NIT Jalandhar, INDIA 

  Access Paper or Ask Questions

Lhotse: a speech data representation library for the modern deep learning ecosystem

Oct 25, 2021
Piotr Żelasko, Daniel Povey, Jan "Yenda" Trmal, Sanjeev Khudanpur

Speech data is notoriously difficult to work with due to a variety of codecs, lengths of recordings, and meta-data formats. We present Lhotse, a speech data representation library that draws upon lessons learned from Kaldi speech recognition toolkit and brings its concepts into the modern deep learning ecosystem. Lhotse provides a common JSON description format with corresponding Python classes and data preparation recipes for over 30 popular speech corpora. Various datasets can be easily combined together and re-purposed for different tasks. The library handles multi-channel recordings, long recordings, local and cloud storage, lazy and on-the-fly operations amongst other features. We introduce Cut and CutSet concepts, which simplify common data wrangling tasks for audio and help incorporate acoustic context of speech utterances. Finally, we show how Lhotse leverages PyTorch data API abstractions and adopts them to handle speech data for deep learning.

* Accepted for presentation at NeurIPS 2021 Data-Centric AI (DCAI) Workshop 

  Access Paper or Ask Questions

Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings

Jun 08, 2021
Marcely Zanon Boito, Bolaji Yusuf, Lucas Ondel, Aline Villavicencio, Laurent Besacier

When documenting oral-languages, Unsupervised Word Segmentation (UWS) from speech is a useful, yet challenging, task. It can be performed from phonetic transcriptions, or in the absence of these, from the output of unsupervised speech discretization models. These discretization models are trained using raw speech only, producing discrete speech units which can be applied for downstream (text-based) tasks. In this paper we compare five of these models: three Bayesian and two neural approaches, with regards to the exploitability of the produced units for UWS. Two UWS models are experimented with and we report results for Finnish, Hungarian, Mboshi, Romanian and Russian in a low-resource setting (using only 5k sentences). Our results suggest that neural models for speech discretization are difficult to exploit in our setting, and that it might be necessary to adapt them to limit sequence length. We obtain our best UWS results by using the SHMM and H-SHMM Bayesian models, which produce high quality, yet compressed, discrete representations of the input speech signal.

  Access Paper or Ask Questions

Phoneme-Based Ratio Mask Estimation for Reverberant Speech Enhancement in Cochlear Implant Processors

May 28, 2021
Kevin M. Chu, Leslie M. Collins, Boyla O. Mainsah

Cochlear implant (CI) users have considerable difficulty in understanding speech in reverberant listening environments. Time-frequency (T-F) masking is a common technique that aims to improve speech intelligibility by multiplying reverberant speech by a matrix of gain values to suppress T-F bins dominated by reverberation. Recently proposed mask estimation algorithms leverage machine learning approaches to distinguish between target speech and reverberant reflections. However, the spectro-temporal structure of speech is highly variable and dependent on the underlying phoneme. One way to potentially overcome this variability is to leverage explicit knowledge of phonemic information during mask estimation. This study proposes a phoneme-based mask estimation algorithm, where separate mask estimation models are trained for each phoneme. Sentence recognition tests were conducted in normal hearing listeners to determine whether a phoneme-based mask estimation algorithm is beneficial in the ideal scenario where perfect knowledge of the phoneme is available. The results showed that the phoneme-based masks improved the intelligibility of vocoded speech when compared to conventional phoneme-independent masks. The results suggest that a phoneme-based speech enhancement strategy may potentially benefit CI users in reverberant listening environments.

  Access Paper or Ask Questions

Effect of noise suppression losses on speech distortion and ASR performance

Nov 23, 2021
Sebastian Braun, Hannes Gamper

Deep learning based speech enhancement has made rapid development towards improving quality, while models are becoming more compact and usable for real-time on-the-edge inference. However, the speech quality scales directly with the model size, and small models are often still unable to achieve sufficient quality. Furthermore, the introduced speech distortion and artifacts greatly harm speech quality and intelligibility, and often significantly degrade automatic speech recognition (ASR) rates. In this work, we shed light on the success of the spectral complex compressed mean squared error (MSE) loss, and how its magnitude and phase-aware terms are related to the speech distortion vs. noise reduction trade off. We further investigate integrating pre-trained reference-less predictors for mean opinion score (MOS) and word error rate (WER), and pre-trained embeddings on ASR and sound event detection. Our analyses reveal that none of the pre-trained networks added significant performance over the strong spectral loss.

* submitted to ICASSP 2022 

  Access Paper or Ask Questions

Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation

May 17, 2020
Won Ik Cho, Donghyun Kwak, Jiwon Yoon, Nam Soo Kim

Speech is one of the most effective means of communication and is full of information that helps the transmission of utterer's thoughts. However, mainly due to the cumbersome processing of acoustic features, phoneme or word posterior probability has frequently been discarded in understanding the natural language. Thus, some recent spoken language understanding (SLU) modules have utilized an end-to-end structure that preserves the uncertainty information. This further reduces the propagation of speech recognition error and guarantees computational efficiency. We claim that in this process, the speech comprehension can benefit from the inference of massive pre-trained language models (LMs). We transfer the knowledge from a concrete Transformer-based text LM to an SLU module which can face a data shortage, based on recent cross-modal distillation methodologies. We demonstrate the validity of our proposal upon the performance on the Fluent Speech Command dataset. Thereby, we experimentally verify our hypothesis that the knowledge could be shared from the top layer of the LM to a fully speech-based module, in which the abstracted speech is expected to meet the semantic representation.

* Preprint; 5 pages, 1 figure, 4 tables 

  Access Paper or Ask Questions

Phoneme-level speech and natural language intergration for agglutinative languages

Nov 05, 1994
Geunbae Lee Jong-Hyeok Lee Kyunghee Kim

A new tightly coupled speech and natural language integration model is presented for a TDNN-based large vocabulary continuous speech recognition system. Unlike the popular n-best techniques developed for integrating mainly HMM-based speech and natural language systems in word level, which is obviously inadequate for the morphologically complex agglutinative languages, our model constructs a spoken language system based on the phoneme-level integration. The TDNN-CYK spoken language architecture is designed and implemented using the TDNN-based diphone recognition module integrated with the table-driven phonological/morphological co-analysis. Our integration model provides a seamless integration of speech and natural language for connectionist speech recognition systems especially for morphologically complex languages such as Korean. Our experiment results show that the speaker-dependent continuous Eojeol (word) recognition can be integrated with the morphological analysis with over 80\% morphological analysis success rate directly from the speech input for the middle-level vocabularies.

* 12 pages, Latex Postscript, compressed, uuencoded, will be presented in TWLT-8 

  Access Paper or Ask Questions

Towards Reducing the Need for Speech Training Data To Build Spoken Language Understanding Systems

Feb 26, 2022
Samuel Thomas, Hong-Kwang J. Kuo, Brian Kingsbury, George Saon

The lack of speech data annotated with labels required for spoken language understanding (SLU) is often a major hurdle in building end-to-end (E2E) systems that can directly process speech inputs. In contrast, large amounts of text data with suitable labels are usually available. In this paper, we propose a novel text representation and training methodology that allows E2E SLU systems to be effectively constructed using these text resources. With very limited amounts of additional speech, we show that these models can be further improved to perform at levels close to similar systems built on the full speech datasets. The efficacy of our proposed approach is demonstrated on both intent and entity tasks using three different SLU datasets. With text-only training, the proposed system achieves up to 90% of the performance possible with full speech training. With just an additional 10% of speech data, these models significantly improve further to 97% of full performance.

* \c{opyright}2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. arXiv admin note: text overlap with arXiv:2202.13155 

  Access Paper or Ask Questions