Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Speech Emotion Recognition with Dual-Sequence LSTM Architecture

Oct 20, 2019
Jianyou Wang, Michael Xue, Ryan Culhane, Enmao Diao, Jie Ding, Vahid Tarokh

Speech Emotion Recognition (SER) has emerged as a critical component of the next generation of human-machine interfacing technologies. In this work, we propose a new dual-level model that combines handcrafted and raw features for audio signals. Each utterance is preprocessed into a handcrafted input and two mel-spectrograms at different time-frequency resolutions. An LSTM processes the handcrafted input, while a novel LSTM architecture, denoted as Dual-Sequence LSTM (DS-LSTM), processes the two mel-spectrograms simultaneously. The outputs are later averaged to produce a final classification of the utterance. Our proposed model achieves, on average, a weighted accuracy of 72.7% and an unweighted accuracy of 73.3% --- a 6% improvement over current state-of-the-art models --- and is comparable with multimodal SER models that leverage textual information.

* Submitted to ICASSP 2020 for review 

  Access Paper or Ask Questions

Using Deep Learning for Detecting Spoofing Attacks on Speech Signals

Jan 19, 2016
Alan Godoy, Flávio Simões, José Augusto Stuchi, Marcus de Assis Angeloni, Mário Uliani, Ricardo Violato

It is well known that speaker verification systems are subject to spoofing attacks. The Automatic Speaker Verification Spoofing and Countermeasures Challenge -- ASVSpoof2015 -- provides a standard spoofing database, containing attacks based on synthetic speech, along with a protocol for experiments. This paper describes CPqD's systems submitted to the ASVSpoof2015 Challenge, based on deep neural networks, working both as a classifier and as a feature extraction module for a GMM and a SVM classifier. Results show the validity of this approach, achieving less than 0.5\% EER for known attacks.

  Access Paper or Ask Questions

Crossmodal learning for audio-visual speech event localization

Mar 09, 2020
Rahul Sharma, Krishna Somandepalli, Shrikanth Narayanan

An objective understanding of media depictions, such as about inclusive portrayals of how much someone is heard and seen on screen in film and television, requires the machines to discern automatically who, when, how and where someone is talking. Media content is rich in multiple modalities such as visuals and audio which can be used to learn speaker activity in videos. In this work, we present visual representations that have implicit information about when someone is talking and where. We propose a crossmodal neural network for audio speech event detection using the visual frames. We use the learned representations for two downstream tasks: i) audio-visual voice activity detection ii) active speaker localization in video frames. We present a state-of-the-art audio-visual voice activity detection system and demonstrate that the learned embeddings can effectively localize to active speakers in the visual frames.

  Access Paper or Ask Questions

Tied Probabilistic Linear Discriminant Analysis for Speech Recognition

Nov 04, 2014
Liang Lu, Steve Renals

Acoustic models using probabilistic linear discriminant analysis (PLDA) capture the correlations within feature vectors using subspaces which do not vastly expand the model. This allows high dimensional and correlated feature spaces to be used, without requiring the estimation of multiple high dimension covariance matrices. In this letter we extend the recently presented PLDA mixture model for speech recognition through a tied PLDA approach, which is better able to control the model size to avoid overfitting. We carried out experiments using the Switchboard corpus, with both mel frequency cepstral coefficient features and bottleneck feature derived from a deep neural network. Reductions in word error rate were obtained by using tied PLDA, compared with the PLDA mixture model, subspace Gaussian mixture models, and deep neural networks.

  Access Paper or Ask Questions

CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus

Feb 04, 2020
Changhan Wang, Juan Pino, Anne Wu, Jiatao Gu

Spoken language translation has recently witnessed a resurgence in popularity, thanks to the development of end-to-end models and the creation of new corpora, such as Augmented LibriSpeech and MuST-C. Existing datasets involve language pairs with English as a source language, involve very specific domains or are low resource. We introduce CoVoST, a multilingual speech-to-text translation corpus from 11 languages into English, diversified with over 11,000 speakers and over 60 accents. We describe the dataset creation methodology and provide empirical evidence of the quality of the data. We also provide initial benchmarks, including, to our knowledge, the first end-to-end many-to-one multilingual models for spoken language translation. CoVoST is released under CC0 license and free to use. We also provide additional evaluation data derived from Tatoeba under CC licenses.

* Submitted to LREC 2020 

  Access Paper or Ask Questions

Phoneme-based speech recognition for commanding the robotic Arm

Jan 25, 2020
Adwait P Naik

Controlling a robot still requires traditional user interfaces. A more intuitive approach is using verbal or gesture commands. In this paper, we propose a robotic arm that can recognize the human voice commands. The Speech recognition is an essential asset for the robot, enhancing its ability to interact with human beings using their natural form of communication. The approach is verified by deploying the robotic arm into different environments with low, high and medium noise perturbations, where it is tested to perform a set of tasks. With this approach, we have successfully reduced the response time, enhanced the accuracy of the robotic arm to grasp the voice commands and reduced the sentence overlapping by a significant amount. The entire system is divided into three modules: the manipulator, the voice recognition module and the microcontroller. The robotic arm is programmed to orient to the direction where the signal to noise ratio is maximum.

* Pages - 6, Words - 2,581, Characters(with spaces)-13,971, Paragraphs- 140, Lines-540, Figures-8 

  Access Paper or Ask Questions

Instance-Based Model Adaptation For Direct Speech Translation

Oct 23, 2019
Mattia Antonino Di Gangi, Viet-Nhat Nguyen, Matteo Negri, Marco Turchi

Despite recent technology advancements, the effectiveness of neural approaches to end-to-end speech-to-text translation is still limited by the paucity of publicly available training corpora. We tackle this limitation with a method to improve data exploitation and boost the system's performance at inference time. Our approach allows us to customize "on the fly" an existing model to each incoming translation request. At its core, it exploits an instance selection procedure to retrieve, from a given pool of data, a small set of samples similar to the input query in terms of latent properties of its audio signal. The retrieved samples are then used for an instance-specific fine-tuning of the model. We evaluate our approach in three different scenarios. In all data conditions (different languages, in/out-of-domain adaptation), our instance-based adaptation yields coherent performance gains over static models.

* 6 pages, under review at ICASSP 2020 

  Access Paper or Ask Questions

Revisiting IPA-based Cross-lingual Text-to-speech

Oct 18, 2021
Haitong Zhang, Haoyue Zhan, Yang Zhang, Xinyuan Yu, Yue Lin

International Phonetic Alphabet (IPA) has been widely used in cross-lingual text-to-speech (TTS) to achieve cross-lingual voice cloning (CL VC). However, IPA itself has been understudied in cross-lingual TTS. In this paper, we report some empirical findings of building a cross-lingual TTS model using IPA as inputs. Experiments show that the way to process the IPA and suprasegmental sequence has a negligible impact on the CL VC performance. Furthermore, we find that using a dataset including one speaker per language to build an IPA-based TTS system would fail CL VC since the language-unique IPA and tone/stress symbols could leak the speaker information. In addition, we experiment with different combinations of speakers in the training dataset to further investigate the effect of the number of speakers on the CL VC performance.

* Submitted to ICASSP2022 

  Access Paper or Ask Questions

Comparing acoustic analyses of speech data collected remotely

Mar 01, 2021
Cong Zhang, Kathleen Jepson, Georg Lohfink, Amalia Arvaniti

Face-to-face speech data collection has been next to impossible globally due to COVID-19 restrictions. To address this problem, simultaneous recordings of three repetitions of the cardinal vowels were made using a Zoom H6 Handy Recorder with external microphone (henceforth H6) and compared with two alternatives accessible to potential participants at home: the Zoom meeting application (henceforth Zoom) and two lossless mobile phone applications (Awesome Voice Recorder, and Recorder; henceforth Phone). F0 was tracked accurately by all devices; however, for formant analysis (F1, F2, F3) Phone performed better than Zoom, i.e. more similarly to H6. Zoom recordings also exhibited unexpected drops in intensity. The results suggest that lossless format phone recordings present a viable option for at least some phonetic studies.

* 20 pages, 3 figures 

  Access Paper or Ask Questions

Interrupted and cascaded permutation invariant training for speech separation

Oct 28, 2019
Gene-Ping Yang, Szu-Lin Wu, Yao-Wen Mao, Hung-yi Lee, Lin-shan Lee

Permutation Invariant Training (PIT) has long been a stepping stone method for training speech separation model in handling the label ambiguity problem. With PIT selecting the minimum cost label assignments dynamically, very few studies considered the separation problem to be optimizing both the model parameters and the label assignments, but focused on searching for good model architecture and parameters. In this paper, we investigate instead for a given model architecture the various flexible label assignment strategies for training the model, rather than directly using PIT. Surprisingly, we discover a significant performance boost compared to PIT is possible if the model is trained with fixed label assignments and a good set of labels is chosen. With fixed label training cascaded between two sections of PIT, we achieved the state-of-the-art performance on WSJ0-2mix without changing the model architecture at all.

  Access Paper or Ask Questions