Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Speaker attribution with voice profiles by graph-based semi-supervised learning

Feb 06, 2021
Jixuan Wang, Xiong Xiao, Jian Wu, Ranjani Ramamurthy, Frank Rudzicz, Michael Brudno

Speaker attribution is required in many real-world applications, such as meeting transcription, where speaker identity is assigned to each utterance according to speaker voice profiles. In this paper, we propose to solve the speaker attribution problem by using graph-based semi-supervised learning methods. A graph of speech segments is built for each session, on which segments from voice profiles are represented by labeled nodes while segments from test utterances are unlabeled nodes. The weight of edges between nodes is evaluated by the similarities between the pretrained speaker embeddings of speech segments. Speaker attribution then becomes a semi-supervised learning problem on graphs, on which two graph-based methods are applied: label propagation (LP) and graph neural networks (GNNs). The proposed approaches are able to utilize the structural information of the graph to improve speaker attribution performance. Experimental results on real meeting data show that the graph based approaches reduce speaker attribution error by up to 68% compared to a baseline speaker identification approach that processes each utterance independently.

* Interspeech 2020 

  Access Paper or Ask Questions

Acoustics Based Intent Recognition Using Discovered Phonetic Units for Low Resource Languages

Nov 07, 2020
Akshat Gupta, Xinjian Li, Sai Krishna Rallabandi, Alan W Black

With recent advancements in language technologies, humansare now interacting with technology through speech. To in-crease the reach of these technologies, we need to build suchsystems in local languages. A major bottleneck here are theunderlying data-intensive parts that make up such systems,including automatic speech recognition (ASR) systems thatrequire large amounts of labelled data. With the aim of aidingdevelopment of dialog systems in low resourced languages,we propose a novel acoustics based intent recognition systemthat uses discovered phonetic units for intent classification.The system is made up of two blocks - the first block gen-erates a transcript of discovered phonetic units for the inputaudio, and the second block which performs intent classifi-cation from the generated phonemic transcripts. Our workpresents results for such a system for two languages families- Indic languages and Romance languages, for two differentintent recognition tasks. We also perform multilingual train-ing of our intent classifier and show improved cross-lingualtransfer and performance on an unknown language with zeroresources in the same language family.

  Access Paper or Ask Questions

Warped Language Models for Noise Robust Language Understanding

Nov 03, 2020
Mahdi Namazifar, Gokhan Tur, Dilek Hakkani Tür

Masked Language Models (MLM) are self-supervised neural networks trained to fill in the blanks in a given sentence with masked tokens. Despite the tremendous success of MLMs for various text based tasks, they are not robust for spoken language understanding, especially for spontaneous conversational speech recognition noise. In this work we introduce Warped Language Models (WLM) in which input sentences at training time go through the same modifications as in MLM, plus two additional modifications, namely inserting and dropping random tokens. These two modifications extend and contract the sentence in addition to the modifications in MLMs, hence the word "warped" in the name. The insertion and drop modification of the input text during training of WLM resemble the types of noise due to Automatic Speech Recognition (ASR) errors, and as a result WLMs are likely to be more robust to ASR noise. Through computational results we show that natural language understanding systems built on top of WLMs perform better compared to those built based on MLMs, especially in the presence of ASR errors.

* To appear at IEEE SLT 2021 

  Access Paper or Ask Questions

Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus

Aug 25, 2020
Cal Peyser, Sepand Mavandadi, Tara N. Sainath, James Apfel, Ruoming Pang, Shankar Kumar

End-to-end (E2E) automatic speech recognition (ASR) systems lack the distinct language model (LM) component that characterizes traditional speech systems. While this simplifies the model architecture, it complicates the task of incorporating text-only data into training, which is important to the recognition of tail words that do not occur often in audio-text pairs. While shallow fusion has been proposed as a method for incorporating a pre-trained LM into an E2E model at inference time, it has not yet been explored for very large text corpora, and it has been shown to be very sensitive to hyperparameter settings in the beam search. In this work, we apply shallow fusion to incorporate a very large text corpus into a state-of-the-art E2EASR model. We explore the impact of model size and show that intelligent pruning of the training set can be more effective than increasing the parameter count. Additionally, we show that incorporating the LM in minimum word error rate (MWER) fine tuning makes shallow fusion far less dependent on optimal hyperparameter settings, reducing the difficulty of that tuning problem.

  Access Paper or Ask Questions

Investigation of Speaker-adaptation methods in Transformer based ASR

Aug 07, 2020
Vishwas M. Shetty, Metilda Sagaya Mary N J, S. Umesh

End-to-end models are fast replacing conventional hybrid models in automatic speech recognition. A transformer is a sequence-to-sequence framework solely based on attention, that was initially applied to machine translation task. This end-to-end framework has been shown to give promising results when used for automatic speech recognition as well. In this paper, we explore different ways of incorporating speaker information while training a transformer-based model to improve its performance. We present speaker information in the form of speaker embeddings for each of the speakers. Two broad categories of speaker embeddings are used: (i)fixed embeddings, and (ii)learned embeddings. We experiment using speaker embeddings learned along with the model training, as well as one-hot vectors and x-vectors. Using these different speaker embeddings, we obtain an average relative improvement of 1% to 3% in the token error rate. We report results on the NPTEL lecture database. NPTEL is an open-source e-learning portal providing content from top Indian universities.

* 5 pages, 6 figures 

  Access Paper or Ask Questions

Query-by-example on-device keyword spotting

Oct 22, 2019
Byeonggeun Kim, Mingu Lee, Jinkyu Lee, Yeonseok Kim, Kyuwoong Hwang

A keyword spotting (KWS) system determines the existence of, usually predefined, keyword in a continuous speech stream. This paper presents a query-by-example on-device KWS system which is user-specific. The proposed system consists of two main steps: query enrollment and testing. In query enrollment step, phonetic posteriors are output by a small-footprint automatic speech recognition model based on connectionist temporal classification. Using the phonetic-level posteriorgram, hypothesis graph of finite-state transducer (FST) is built, thus can enroll any keywords thus avoiding an out-of-vocabulary problem. In testing, a log-likelihood is scored for input audio using the FST. We propose a threshold prediction method while using the user-specific keyword hypothesis only. The system generates query-specific negatives by rearranging each query utterance in waveform. The threshold is decided based on the enrollment queries and generated negatives. We tested two keywords in English, and the proposed work shows promising performance while preserving simplicity.

* Will appear in IEEE ASRU 2019 

  Access Paper or Ask Questions

Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss

Mar 24, 2019
Chenglin Xu, Wei Rao, Eng Siong Chng, Haizhou Li

The SpeakerBeam-FE (SBF) method is proposed for speaker extraction. It attempts to overcome the problem of unknown number of speakers in an audio recording during source separation. The mask approximation loss of SBF is sub-optimal, which doesn't calculate direct signal reconstruction error and consider the speech context. To address these problems, this paper proposes a magnitude and temporal spectrum approximation loss to estimate a phase sensitive mask for the target speaker with the speaker characteristics. Moreover, this paper explores a concatenation framework instead of the context adaptive deep neural network in the SBF method to encode a speaker embedding into the mask estimation network. Experimental results under open evaluation condition show that the proposed method achieves 70.4% and 17.7% relative improvement over the SBF baseline on signal-to-distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ), respectively. A further analysis demonstrates 69.1% and 72.3% relative SDR improvements obtained by the proposed method for different and same gender mixtures.

* Accepted in ICASSP 2019 

  Access Paper or Ask Questions

The Virtual Doctor: An Interactive Artificial Intelligence based on Deep Learning for Non-Invasive Prediction of Diabetes

Mar 09, 2019
Sebastian Spänig, Agnes Emberger-Klein, Jan-Peter Sowa, Ali Canbay, Klaus Menrad, Dominik Heider

Artificial intelligence (AI) will pave the way to a new era in medicine. However, currently available AI systems do not interact with a patient, e.g., for anamnesis, and thus are only used by the physicians for predictions in diagnosis or prognosis. However, these systems are widely used, e.g., in diabetes or cancer prediction. In the current study, we developed an AI that is able to interact with a patient (virtual doctor) by using a speech recognition and speech synthesis system and thus can autonomously interact with the patient, which is particularly important for, e.g., rural areas, where the availability of primary medical care is strongly limited by low population densities. As a proof-of-concept, the system is able to predict type 2 diabetes mellitus (T2DM) based on non-invasive sensors and deep neural networks. Moreover, the system provides an easy-to-interpret probability estimation for T2DM for a given patient. Besides the development of the AI, we further analyzed the acceptance of young people for AI in healthcare to estimate the impact of such system in the future.

* 16 pages, 4 figues 

  Access Paper or Ask Questions

A Study of Language and Classifier-independent Feature Analysis for Vocal Emotion Recognition

Nov 14, 2018
Fatemeh Noroozi, Marina Marjanovic, Angelina Njegus, Sergio Escalera, Gholamreza Anbarjafari

Every speech signal carries implicit information about the emotions, which can be extracted by speech processing methods. In this paper, we propose an algorithm for extracting features that are independent from the spoken language and the classification method to have comparatively good recognition performance on different languages independent from the employed classification methods. The proposed algorithm is composed of three stages. In the first stage, we propose a feature ranking method analyzing the state-of-the-art voice quality features. In the second stage, we propose a method for finding the subset of the common features for each language and classifier. In the third stage, we compare our approach with the recognition rate of the state-of-the-art filter methods. We use three databases with different languages, namely, Polish, Serbian and English. Also three different classifiers, namely, nearest neighbour, support vector machine and gradient descent neural network, are employed. It is shown that our method for selecting the most significant language-independent and method-independent features in many cases outperforms state-of-the-art filter methods.

* 24 pages, 4 figure 

  Access Paper or Ask Questions