Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thomas Niesler

Feature Trajectory Dynamic Time Warping for Clustering of Speech Segments

Oct 30, 2018
Lerato Lerato, Thomas Niesler

Figure 1 for Feature Trajectory Dynamic Time Warping for Clustering of Speech Segments

Figure 2 for Feature Trajectory Dynamic Time Warping for Clustering of Speech Segments

Figure 3 for Feature Trajectory Dynamic Time Warping for Clustering of Speech Segments

Figure 4 for Feature Trajectory Dynamic Time Warping for Clustering of Speech Segments

Dynamic time warping (DTW) can be used to compute the similarity between two sequences of generally differing length. We propose a modification to DTW that performs individual and independent pairwise alignment of feature trajectories. The modified technique, termed feature trajectory dynamic time warping (FTDTW), is applied as a similarity measure in the agglomerative hierarchical clustering of speech segments. Experiments using MFCC and PLP parametrisations extracted from TIMIT and from the Spoken Arabic Digit Dataset (SADD) show consistent and statistically significant improvements in the quality of the resulting clusters in terms of F-measure and normalised mutual information (NMI).

* 10 pages

Via

Access Paper or Ask Questions

Building a Unified Code-Switching ASR System for South African Languages

Jul 28, 2018
Emre Yılmaz, Astik Biswas, Ewald van der Westhuizen, Febe de Wet, Thomas Niesler

Figure 1 for Building a Unified Code-Switching ASR System for South African Languages

Figure 2 for Building a Unified Code-Switching ASR System for South African Languages

Figure 3 for Building a Unified Code-Switching ASR System for South African Languages

Figure 4 for Building a Unified Code-Switching ASR System for South African Languages

We present our first efforts towards building a single multilingual automatic speech recognition (ASR) system that can process code-switching (CS) speech in five languages spoken within the same population. This contrasts with related prior work which focuses on the recognition of CS speech in bilingual scenarios. Recently, we have compiled a small five-language corpus of South African soap opera speech which contains examples of CS between 5 languages occurring in various contexts such as using English as the matrix language and switching to other indigenous languages. The ASR system presented in this work is trained on 4 corpora containing English-isiZulu, English-isiXhosa, English-Setswana and English-Sesotho CS speech. The interpolation of multiple language models trained on these language pairs enables the ASR system to hypothesize mixed word sequences from these 5 languages. We evaluate various state-of-the-art acoustic models trained on this 5-lingual training data and report ASR accuracy and language recognition performance on the development and test sets of the South African multilingual soap opera corpus.

* Acccepted for publication at Interspeech 2018

Via

Access Paper or Ask Questions

Automatic Speech Recognition for Humanitarian Applications in Somali

Jul 23, 2018
Raghav Menon, Astik Biswas, Armin Saeb, John Quinn, Thomas Niesler

Figure 1 for Automatic Speech Recognition for Humanitarian Applications in Somali

Figure 2 for Automatic Speech Recognition for Humanitarian Applications in Somali

Figure 3 for Automatic Speech Recognition for Humanitarian Applications in Somali

Figure 4 for Automatic Speech Recognition for Humanitarian Applications in Somali

We present our first efforts in building an automatic speech recognition system for Somali, an under-resourced language, using 1.57 hrs of annotated speech for acoustic model training. The system is part of an ongoing effort by the United Nations (UN) to implement keyword spotting systems supporting humanitarian relief programmes in parts of Africa where languages are severely under-resourced. We evaluate several types of acoustic model, including recent neural architectures. Language model data augmentation using a combination of recurrent neural networks (RNN) and long short-term memory neural networks (LSTMs) as well as the perturbation of acoustic data are also considered. We find that both types of data augmentation are beneficial to performance, with our best system using a combination of convolutional neural networks (CNNs), time-delay neural networks (TDNNs) and bi-directional long short term memory (BLSTMs) to achieve a word error rate of 53.75%.

* 5 pages, 3 figures, 5 tables accepted at SLTU 2018

Via

Access Paper or Ask Questions

ASR-free CNN-DTW keyword spotting using multilingual bottleneck features for almost zero-resource languages

Jul 23, 2018
Raghav Menon, Herman Kamper, Emre Yilmaz, John Quinn, Thomas Niesler

Figure 1 for ASR-free CNN-DTW keyword spotting using multilingual bottleneck features for almost zero-resource languages

Figure 2 for ASR-free CNN-DTW keyword spotting using multilingual bottleneck features for almost zero-resource languages

Figure 3 for ASR-free CNN-DTW keyword spotting using multilingual bottleneck features for almost zero-resource languages

Figure 4 for ASR-free CNN-DTW keyword spotting using multilingual bottleneck features for almost zero-resource languages

We consider multilingual bottleneck features (BNFs) for nearly zero-resource keyword spotting. This forms part of a United Nations effort using keyword spotting to support humanitarian relief programmes in parts of Africa where languages are severely under-resourced. We use 1920 isolated keywords (40 types, 34 minutes) as exemplars for dynamic time warping (DTW) template matching, which is performed on a much larger body of untranscribed speech. These DTW costs are used as targets for a convolutional neural network (CNN) keyword spotter, giving a much faster system than direct DTW. Here we consider how available data from well-resourced languages can improve this CNN-DTW approach. We show that multilingual BNFs trained on ten languages improve the area under the ROC curve of a CNN-DTW system by 10.9% absolute relative to the MFCC baseline. By combining low-resource DTW-based supervision with information from well-resourced languages, CNN-DTW is a competitive option for low-resource keyword spotting.

* 5 pages, 3 figures, 3 tables, 1 equation accepted at SLTU 2018

Via

Access Paper or Ask Questions

Fast ASR-free and almost zero-resource keyword spotting using DTW and CNNs for humanitarian monitoring

Jun 25, 2018
Raghav Menon, Herman Kamper, John Quinn, Thomas Niesler

Figure 1 for Fast ASR-free and almost zero-resource keyword spotting using DTW and CNNs for humanitarian monitoring

Figure 2 for Fast ASR-free and almost zero-resource keyword spotting using DTW and CNNs for humanitarian monitoring

Figure 3 for Fast ASR-free and almost zero-resource keyword spotting using DTW and CNNs for humanitarian monitoring

Figure 4 for Fast ASR-free and almost zero-resource keyword spotting using DTW and CNNs for humanitarian monitoring

We use dynamic time warping (DTW) as supervision for training a convolutional neural network (CNN) based keyword spotting system using a small set of spoken isolated keywords. The aim is to allow rapid deployment of a keyword spotting system in a new language to support urgent United Nations (UN) relief programmes in parts of Africa where languages are extremely under-resourced and the development of annotated speech resources is infeasible. First, we use 1920 recorded keywords (40 keyword types, 34 minutes of speech) as exemplars in a DTW-based template matching system and apply it to untranscribed broadcast speech. Then, we use the resulting DTW scores as targets to train a CNN on the same unlabelled speech. In this way we use just 34 minutes of labelled speech, but leverage a large amount of unlabelled data for training. While the resulting CNN keyword spotter cannot match the performance of the DTW-based system, it substantially outperforms a CNN classifier trained only on the keywords, improving the area under the ROC curve from 0.54 to 0.64. Because our CNN system is several orders of magnitude faster at runtime than the DTW system, it represents the most viable keyword spotter on this extremely limited dataset.

* 5 pages, 4 figures, 3 tables, accepted at Interspeech 2018

Via

Access Paper or Ask Questions