Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

EIHW-MTG: Second DiCOVA Challenge System Report

Oct 18, 2021
Adria Mallol-Ragolta, Helena Cuesta, Emilia Gómez, Björn W. Schuller

This work presents an outer product-based approach to fuse the embedded representations generated from the spectrograms of cough, breath, and speech samples for the automatic detection of COVID-19. To extract deep learnt representations from the spectrograms, we compare the performance of a CNN trained from scratch and a ResNet18 architecture fine-tuned for the task at hand. Furthermore, we investigate whether the patients' sex and the use of contextual attention mechanisms is beneficial. Our experiments use the dataset released as part of the Second Diagnosing COVID-19 using Acoustics (DiCOVA) Challenge. The results suggest the suitability of fusing breath and speech information to detect COVID-19. An Area Under the Curve (AUC) of 84.06% is obtained on the test partition when using a CNN trained from scratch with contextual attention mechanisms. When using the ResNet18 architecture for feature extraction, the baseline model scores the highest performance with an AUC of 84.26%.

  Access Paper or Ask Questions

Adapting Pretrained Transformer to Lattices for Spoken Language Understanding

Nov 02, 2020
Chao-Wei Huang, Yun-Nung Chen

Lattices are compact representations that encode multiple hypotheses, such as speech recognition results or different word segmentations. It is shown that encoding lattices as opposed to 1-best results generated by automatic speech recognizer (ASR) boosts the performance of spoken language understanding (SLU). Recently, pretrained language models with the transformer architecture have achieved the state-of-the-art results on natural language understanding, but their ability of encoding lattices has not been explored. Therefore, this paper aims at adapting pretrained transformers to lattice inputs in order to perform understanding tasks specifically for spoken language. Our experiments on the benchmark ATIS dataset show that fine-tuning pretrained transformers with lattice inputs yields clear improvement over fine-tuning with 1-best results. Further evaluation demonstrates the effectiveness of our methods under different acoustic conditions. Our code is available at

* ASRU 2019 

  Access Paper or Ask Questions

Using Holographically Compressed Embeddings in Question Answering

Jul 14, 2020
Salvador E. Barbosa

Word vector representations are central to deep learning natural language processing models. Many forms of these vectors, known as embeddings, exist, including word2vec and GloVe. Embeddings are trained on large corpora and learn the word's usage in context, capturing the semantic relationship between words. However, the semantics from such training are at the level of distinct words (known as word types), and can be ambiguous when, for example, a word type can be either a noun or a verb. In question answering, parts-of-speech and named entity types are important, but encoding these attributes in neural models expands the size of the input. This research employs holographic compression of pre-trained embeddings, to represent a token, its part-of-speech, and named entity type, in the same dimension as representing only the token. The implementation, in a modified question answering recurrent deep learning network, shows that semantic relationships are preserved, and yields strong performance.

* 12 pages, 6 figures, 1 table, 9th International Conference on Advanced Information Technologies and Applications (ICAITA 2020), July 11~12, 2020, Toronto, Canada, Advanced Natural Language Processing Sub-Conference (AdNLP 2020) 

  Access Paper or Ask Questions

Cumulative Adaptation for BLSTM Acoustic Models

Jun 14, 2019
Markus Kitza, Pavel Golik, Ralf Schlüter, Hermann Ney

This paper addresses the robust speech recognition problem as an adaptation task. Specifically, we investigate the cumulative application of adaptation methods. A bidirectional Long Short-Term Memory (BLSTM) based neural network, capable of learning temporal relationships and translation invariant representations, is used for robust acoustic modelling. Further, i-vectors were used as an input to the neural network to perform instantaneous speaker and environment adaptation, providing 8\% relative improvement in word error rate on the NIST Hub5 2000 evaluation test set. By enhancing the first-pass i-vector based adaptation with a second-pass adaptation using speaker and environment dependent transformations within the network, a further relative improvement of 5\% in word error rate was achieved. We have reevaluated the features used to estimate i-vectors and their normalization to achieve the best performance in a modern large scale automatic speech recognition system.

* Submitted to Interspeech 2019 

  Access Paper or Ask Questions

Music and Vocal Separation Using Multi-Band Modulation Based Features

Jun 10, 2014
Sunil Kumar Kopparapu, Meghna Pandharipande, G Sita

The potential use of non-linear speech features has not been investigated for music analysis although other commonly used speech features like Mel Frequency Ceptral Coefficients (MFCC) and pitch have been used extensively. In this paper, we assume an audio signal to be a sum of modulated sinusoidal and then use the energy separation algorithm to decompose the audio into amplitude and frequency modulation components using the non-linear Teager-Kaiser energy operator. We first identify the distribution of these non-linear features for music only and voice only segments in the audio signal in different Mel spaced frequency bands and show that they have the ability to discriminate. The proposed method based on Kullback-Leibler divergence measure is evaluated using a set of Indian classical songs from three different artists. Experimental results show that the discrimination ability is evident in certain low and mid frequency bands (200 - 1500 Hz).

* 5 pages, 5 figures, 2010 IEEE Symposium on Industrial Electronics Applications (ISIEA) 

  Access Paper or Ask Questions

Similarity-Based Estimation of Word Cooccurrence Probabilities

May 02, 1994
Ido Dagan, Fernando Pereira, Lillian Lee

In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and ``eat a beach'' is more likely. Statistical NLP methods determine the likelihood of a word combination according to its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in a given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on ``most similar'' words. We describe a probabilistic word association model based on distributional word similarity, and apply it to improving probability estimates for unseen word bigrams in a variant of Katz's back-off model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error.

* 13 pages, to appear in proceedings of ACL-94 

  Access Paper or Ask Questions

Automatic Spoken Language Identification using a Time-Delay Neural Network

May 19, 2022
Benjamin Kepecs, Homayoon Beigi

Closed-set spoken language identification is the task of recognizing the language being spoken in a recorded audio clip from a set of known languages. In this study, a language identification system was built and trained to distinguish between Arabic, Spanish, French, and Turkish based on nothing more than recorded speech. A pre-existing multilingual dataset was used to train a series of acoustic models based on the Tedlium TDNN model to perform automatic speech recognition. The system was provided with a custom multilingual language model and a specialized pronunciation lexicon with language names prepended to phones. The trained model was used to generate phone alignments to test data from all four languages, and languages were predicted based on a voting scheme choosing the most common language prepend in an utterance. Accuracy was measured by comparing predicted languages to known languages, and was determined to be very high in identifying Spanish and Arabic, and somewhat lower in identifying Turkish and French.

* 6 pages, 6 figures, Technical Report Recognition Technologies, Inc 

  Access Paper or Ask Questions

A Comparison and Combination of Unsupervised Blind Source Separation Techniques

Jun 10, 2021
Christoph Boeddeker, Frederik Rautenberg, Reinhold Haeb-Umbach

Unsupervised blind source separation methods do not require a training phase and thus cannot suffer from a train-test mismatch, which is a common concern in neural network based source separation. The unsupervised techniques can be categorized in two classes, those building upon the sparsity of speech in the Short-Time Fourier transform domain and those exploiting non-Gaussianity or non-stationarity of the source signals. In this contribution, spatial mixture models which fall in the first category and independent vector analysis (IVA) as a representative of the second category are compared w.r.t. their separation performance and the performance of a downstream speech recognizer on a reverberant dataset of reasonable size. Furthermore, we introduce a serial concatenation of the two, where the result of the mixture model serves as initialization of IVA, which achieves significantly better WER performance than each algorithm individually and even approaches the performance of a much more complex neural network based technique.

* Submitted to ITG 2021 

  Access Paper or Ask Questions

Clotho: An Audio Captioning Dataset

Oct 21, 2019
Konstantinos Drossos, Samuel Lipping, Tuomas Virtanen

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e. the caption) of that signal. In this paper we present Clotho, a dataset for audio captioning consisting of 4981 audio samples of 15 to 30 seconds duration and 24 905 captions of eight to 20 words length, and a baseline method to provide initial results. Clotho is built with focus on audio content and caption diversity, and the splits of the data are not hampering the training or evaluation of methods. All sounds are from the Freesound platform, and captions are crowdsourced using Amazon Mechanical Turk and annotators from English speaking countries. Unique words, named entities, and speech transcription are removed with post-processing. Clotho is freely available online (

  Access Paper or Ask Questions

Predicting the Type and Target of Offensive Posts in Social Media

Apr 16, 2019
Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, Ritesh Kumar

As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offensive content. In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media. For this purpose, we complied the Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, which we make publicly available. We discuss the main similarities and differences between OLID and pre-existing datasets for hate speech identification, aggression detection, and similar tasks. We further experiment with and we compare the performance of different machine learning models on OLID.

* Proceedings of the 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) 

  Access Paper or Ask Questions