Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Acoustic feature learning using cross-domain articulatory measurements

Mar 20, 2018
Qingming Tang, Weiran Wang, Karen Livescu

Previous work has shown that it is possible to improve speech recognition by learning acoustic features from paired acoustic-articulatory data, for example by using canonical correlation analysis (CCA) or its deep extensions. One limitation of this prior work is that the learned feature models are difficult to port to new datasets or domains, and articulatory data is not available for most speech corpora. In this work we study the problem of acoustic feature learning in the setting where we have access to an external, domain-mismatched dataset of paired speech and articulatory measurements, either with or without labels. We develop methods for acoustic feature learning in these settings, based on deep variational CCA and extensions that use both source and target domain data and labels. Using this approach, we improve phonetic recognition accuracies on both TIMIT and Wall Street Journal and analyze a number of design choices.

* ICASSP 2018 

  Access Paper or Ask Questions

Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

Feb 09, 2018
Aviv Gabbay, Ariel Ephrat, Tavi Halperin, Shmuel Peleg

Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments. We propose audio-visual methods to isolate the voice of a single speaker and eliminate unrelated sounds. First, face motions captured in the video are used to estimate the speaker's voice, by passing the silent video frames through a video-to-speech neural network-based model. Then the speech predictions are applied as a filter on the noisy input audio. This approach avoids using mixtures of sounds in the learning process, as the number of such possible mixtures is huge, and would inevitably bias the trained model. We evaluate our method on two audio-visual datasets, GRID and TCD-TIMIT, and show that our method attains significant SDR and PESQ improvements over the raw video-to-speech predictions, and a well-known audio-only method.

* Supplementary video: 

  Access Paper or Ask Questions

A Post Auto-regressive GAN Vocoder Focused on Spectrum Fracture

Apr 12, 2022
Zhenxing Lu, Mengnan He, Ruixiong Zhang, Caixia Gong

Generative adversarial networks (GANs) have been indicated their superiority in usage of the real-time speech synthesis. Nevertheless, most of them make use of deep convolutional layers as their backbone, which may cause the absence of previous signal information. However, the generation of speech signals invariably require preceding waveform samples in its reconstruction, as the lack of this can lead to artifacts in generated speech. To address this conflict, in this paper, we propose an improved model: a post auto-regressive (AR) GAN vocoder with a self-attention layer, which merging self-attention in an AR loop. It will not participate in inference, but can assist the generator to learn temporal dependencies within frames in training. Furthermore, an ablation study was done to confirm the contribution of each part. Systematic experiments show that our model leads to a consistent improvement on both objective and subjective evaluation performance.

  Access Paper or Ask Questions

Probabilistic Impact Score Generation using Ktrain-BERT to Identify Hate Words from Twitter Discussions

Nov 25, 2021
Sourav Das, Prasanta Mandal, Sanjay Chatterji

Social media has seen a worrying rise in hate speech in recent times. Branching to several distinct categories of cyberbullying, gender discrimination, or racism, the combined label for such derogatory content can be classified as toxic content in general. This paper presents experimentation with a Keras wrapped lightweight BERT model to successfully identify hate speech and predict probabilistic impact score for the same to extract the hateful words within sentences. The dataset used for this task is the Hate Speech and Offensive Content Detection (HASOC 2021) data from FIRE 2021 in English. Our system obtained a validation accuracy of 82.60%, with a maximum F1-Score of 82.68%. Subsequently, our predictive cases performed significantly well in generating impact scores for successful identification of the hate tweets as well as the hateful words from tweet pools.

* 11 Pages, 10 Figures 

  Access Paper or Ask Questions

Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer

Jul 08, 2021
Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li

Traditional voice conversion(VC) has been focused on speaker identity conversion for speech with a neutral expression. We note that emotional expression plays an essential role in daily communication, and the emotional style of speech can be speaker-dependent. In this paper, we study the technique to jointly convert the speaker identity and speaker-dependent emotional style, that is called expressive voice conversion. We propose a StarGAN-based framework to learn a many-to-many mapping across different speakers, that takes into account speaker-dependent emotional style without the need for parallel data. To achieve this, we condition the generator on emotional style encoding derived from a pre-trained speech emotion recognition(SER) model. The experiments validate the effectiveness of our proposed framework in both objective and subjective evaluations. To our best knowledge, this is the first study on expressive voice conversion.

* Submitted to ASRU 2021 

  Access Paper or Ask Questions

LEAP Submission for the Third DIHARD Diarization Challenge

Apr 06, 2021
Prachi Singh, Rajat Varma, Venkat Krishnamohan, Srikanth Raj Chetupalli, Sriram Ganapathy

The LEAP submission for DIHARD-III challenge is described in this paper. The proposed system is composed of a speech bandwidth classifier, and diarization systems fine-tuned for narrowband and wideband speech separately. We use an end-to-end speaker diarization system for the narrowband conversational telephone speech recordings. For the wideband multi-speaker recordings, we use a neural embedding based clustering approach, similar to the baseline system. The embeddings are extracted from a time-delay neural network (called x-vectors) followed by the graph based path integral clustering (PIC) approach. The LEAP system showed 24% and 18% relative improvements for Track-1 and Track-2 respectively over the baseline system provided by the organizers. This paper describes the challenge submission, the post-evaluation analysis and improvements observed on the DIHARD-III dataset.

* Submitted to INTERSPEECH 2021 

  Access Paper or Ask Questions

Memory Time Span in LSTMs for Multi-Speaker Source Separation

Aug 24, 2018
Jeroen Zegers, Hugo Van hamme

With deep learning approaches becoming state-of-the-art in many speech (as well as non-speech) related machine learning tasks, efforts are being taken to delve into the neural networks which are often considered as a black box. In this paper it is analyzed how recurrent neural network (RNNs) cope with temporal dependencies by determining the relevant memory time span in a long short-term memory (LSTM) cell. This is done by leaking the state variable with a controlled lifetime and evaluating the task performance. This technique can be used for any task to estimate the time span the LSTM exploits in that specific scenario. The focus in this paper is on the task of separating speakers from overlapping speech. We discern two effects: A long term effect, probably due to speaker characterization and a short term effect, probably exploiting phone-size formant tracks.

* Interspeech 2018 

  Access Paper or Ask Questions

Language Identification Using Deep Convolutional Recurrent Neural Networks

Aug 16, 2017
Christian Bartz, Tom Herold, Haojin Yang, Christoph Meinel

Language Identification (LID) systems are used to classify the spoken language from a given audio sample and are typically the first step for many spoken language processing tasks, such as Automatic Speech Recognition (ASR) systems. Without automatic language detection, speech utterances cannot be parsed correctly and grammar rules cannot be applied, causing subsequent speech recognition steps to fail. We propose a LID system that solves the problem in the image domain, rather than the audio domain. We use a hybrid Convolutional Recurrent Neural Network (CRNN) that operates on spectrogram images of the provided audio snippets. In extensive experiments we show, that our model is applicable to a range of noisy scenarios and can easily be extended to previously unknown languages, while maintaining its classification accuracy. We release our code and a large scale training set for LID systems to the community.

* to be presented at ICONIP 2017 

  Access Paper or Ask Questions

Neural Lattice-to-Sequence Models for Uncertain Inputs

Jul 21, 2017
Matthias Sperber, Graham Neubig, Jan Niehues, Alex Waibel

The input to a neural sequence-to-sequence model is often determined by an up-stream system, e.g. a word segmenter, part of speech tagger, or speech recognizer. These up-stream models are potentially error-prone. Representing inputs through word lattices allows making this uncertainty explicit by capturing alternative sequences and their posterior probabilities in a compact form. In this work, we extend the TreeLSTM (Tai et al., 2015) into a LatticeLSTM that is able to consume word lattices, and can be used as encoder in an attentional encoder-decoder model. We integrate lattice posterior scores into this architecture by extending the TreeLSTM's child-sum and forget gates and introducing a bias term into the attention mechanism. We experiment with speech translation lattices and report consistent improvements over baselines that translate either the 1-best hypothesis or the lattice without posterior scores.

* EMNLP 2017 

  Access Paper or Ask Questions

A Universal Deep Room Acoustics Estimator

Sep 29, 2021
Paula Sánchez López, Paul Callens, Milos Cernak

Speech audio quality is subject to degradation caused by an acoustic environment and isotropic ambient and point noises. The environment can lead to decreased speech intelligibility and loss of focus and attention by the listener. Basic acoustic parameters that characterize the environment well are (i) signal-to-noise ratio (SNR), (ii) speech transmission index, (iii) reverberation time, (iv) clarity, and (v) direct-to-reverberant ratio. Except for the SNR, these parameters are usually derived from the Room Impulse Response (RIR) measurements; however, such measurements are often not available. This work presents a universal room acoustic estimator design based on convolutional recurrent neural networks that estimate the acoustic environment measurement blindly and jointly. Our results indicate that the proposed system is robust to non-stationary signal variations and outperforms current state-of-the-art methods.

* WASPAA 2021 
* Room acoustics, Convolutional Recurrent Neural Network, RT60, C50, DRR, STI, SNR 

  Access Paper or Ask Questions