Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Robust Classification using Hidden Markov Models and Mixtures of Normalizing Flows

Feb 15, 2021
Anubhab Ghosh, Antoine Honoré, Dong Liu, Gustav Eje Henter, Saikat Chatterjee

We test the robustness of a maximum-likelihood (ML) based classifier where sequential data as observation is corrupted by noise. The hypothesis is that a generative model, that combines the state transitions of a hidden Markov model (HMM) and the neural network based probability distributions for the hidden states of the HMM, can provide a robust classification performance. The combined model is called normalizing-flow mixture model based HMM (NMM-HMM). It can be trained using a combination of expectation-maximization (EM) and backpropagation. We verify the improved robustness of NMM-HMM classifiers in an application to speech recognition.

* 6 pages. Accepted at MLSP 2020 

  Access Paper or Ask Questions

Teaching Digital Signal Processing by Partial Flipping, Active Learning and Visualization

Jan 31, 2021
Keshab K. Parhi

Effectiveness of teaching digital signal processing can be enhanced by reducing lecture time devoted to theory, and increasing emphasis on applications, programming aspects, visualization and intuitive understanding. An integrated approach to teaching requires instructors to simultaneously teach theory and its applications in storage and processing of audio, speech and biomedical signals. Student engagement can be enhanced by engaging students to work in groups during the class where students can solve short problems and short programming assignments or take quizzes. These approaches will increase student interest in learning the subject and student engagement.

* IEEE Signal Processing Magazine, 38(3), 2021 

  Access Paper or Ask Questions

Punctuation Prediction in Spontaneous Conversations: Can We Mitigate ASR Errors with Retrofitted Word Embeddings?

Apr 13, 2020
Łukasz Augustyniak, Piotr Szymanski, Mikołaj Morzy, Piotr Zelasko, Adrian Szymczak, Jan Mizgajski, Yishay Carmiel, Najim Dehak

Automatic Speech Recognition (ASR) systems introduce word errors, which often confuse punctuation prediction models, turning punctuation restoration into a challenging task. These errors usually take the form of homonyms. We show how retrofitting of the word embeddings on the domain-specific data can mitigate ASR errors. Our main contribution is a method for better alignment of homonym embeddings and the validation of the presented method on the punctuation prediction task. We record the absolute improvement in punctuation prediction accuracy between 6.2% (for question marks) to 9% (for periods) when compared with the state-of-the-art model.

* submitted to INTERSPEECH'20 

  Access Paper or Ask Questions

Representation Mixing for TTS Synthesis

Nov 24, 2018
Kyle Kastner, João Felipe Santos, Yoshua Bengio, Aaron Courville

Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in certain cases. We demonstrate a simple method for combining multiple types of linguistic information in a single encoder, named representation mixing, enabling flexible choice between character, phoneme, or mixed representations during inference. Experiments and user studies on a public audiobook corpus show the efficacy of our approach.

* 5 pages, 3 figures 

  Access Paper or Ask Questions

Joint POS Tagging and Dependency Parsing with Transition-based Neural Networks

Apr 25, 2017
Liner Yang, Meishan Zhang, Yang Liu, Nan Yu, Maosong Sun, Guohong Fu

While part-of-speech (POS) tagging and dependency parsing are observed to be closely related, existing work on joint modeling with manually crafted feature templates suffers from the feature sparsity and incompleteness problems. In this paper, we propose an approach to joint POS tagging and dependency parsing using transition-based neural networks. Three neural network based classifiers are designed to resolve shift/reduce, tagging, and labeling conflicts. Experiments show that our approach significantly outperforms previous methods for joint POS tagging and dependency parsing across a variety of natural languages.

  Access Paper or Ask Questions

Probabilistic Binary-Mask Cocktail-Party Source Separation in a Convolutional Deep Neural Network

Mar 24, 2015
Andrew J. R. Simpson

Separation of competing speech is a key challenge in signal processing and a feat routinely performed by the human auditory brain. A long standing benchmark of the spectrogram approach to source separation is known as the ideal binary mask. Here, we train a convolutional deep neural network, on a two-speaker cocktail party problem, to make probabilistic predictions about binary masks. Our results approach ideal binary mask performance, illustrating that relatively simple deep neural networks are capable of robust binary mask prediction. We also illustrate the trade-off between prediction statistics and separation quality.

  Access Paper or Ask Questions

Change-Point Detection in Time-Series Data by Relative Density-Ratio Estimation

Jan 16, 2013
Song Liu, Makoto Yamada, Nigel Collier, Masashi Sugiyama

The objective of change-point detection is to discover abrupt property changes lying behind time-series data. In this paper, we present a novel statistical change-point detection algorithm based on non-parametric divergence estimation between time-series samples from two retrospective segments. Our method uses the relative Pearson divergence as a divergence measure, and it is accurately and efficiently estimated by a method of direct density-ratio estimation. Through experiments on artificial and real-world datasets including human-activity sensing, speech, and Twitter messages, we demonstrate the usefulness of the proposed method.

  Access Paper or Ask Questions

Three-Stage Quantitative Neural Network Model of the Tip-of-the-Tongue Phenomenon

Jul 09, 2001
Petro M. Gopych

A new three-stage computer artificial neural network model of the tip-of-the-tongue phenomenon is shortly described, and its stochastic nature was demonstrated. A way to calculate strength and appearance probability of tip-of-the-tongue states, neural network mechanism of feeling-of-knowing phenomenon are proposed. The model synthesizes memory, psycholinguistic, and metamemory approaches, bridges speech errors and naming chronometry research traditions. A model analysis of a tip-of-the-tongue case from Anton Chekhov's short story 'A Horsey Name' is performed. A new 'throw-up-one's-arms effect' is defined.

* Proceedings of the IX-th International Conference Knowledge-Dialog-Solution (KDS-2001), held on June 19-22, 2001 in St-Petersburg, Russia, pages 158-165 (in Russian) 

  Access Paper or Ask Questions

Multimodal Representation Learning With Text and Images

Apr 30, 2022
Aishwarya Jayagopal, Ankireddy Monica Aiswarya, Ankita Garg, Srinivasan Kolumam Nandakumar

In recent years, multimodal AI has seen an upward trend as researchers are integrating data of different types such as text, images, speech into modelling to get the best results. This project leverages multimodal AI and matrix factorization techniques for representation learning, on text and image data simultaneously, thereby employing the widely used techniques of Natural Language Processing (NLP) and Computer Vision. The learnt representations are evaluated using downstream classification and regression tasks. The methodology adopted can be extended beyond the scope of this project as it uses Auto-Encoders for unsupervised representation learning.

  Access Paper or Ask Questions

Speaker recognition improvement using blind inversion of distortions

Feb 23, 2022
Marcos Faundez-Zanuy, Jordi Sole-Casals

In this paper we propose the inversion of nonlinear distortions in order to improve the recognition rates of a speaker recognizer system. We study the effect of saturations on the test signals, trying to take into account real situations where the training material has been recorded in a controlled situation but the testing signals present some mismatch with the input signal level (saturations). The experimental results shows that a combination of data fusion with and without nonlinear distortion compensation can improve the recognition rates with saturated test sentences from 80% to 88.57%, while the results with clean speech (without saturation) is 87.76% for one microphone.

* EUSIPCO 2004, Vienna 
* 4 pages 

  Access Paper or Ask Questions