Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Dynamic curriculum learning via data parameters for noise robust keyword spotting

Feb 18, 2021
Takuya Higuchi, Shreyas Saxena, Mehrez Souden, Tien Dung Tran, Masood Delfarah, Chandra Dhir

We propose dynamic curriculum learning via data parameters for noise robust keyword spotting. Data parameter learning has recently been introduced for image processing, where weight parameters, so-called data parameters, for target classes and instances are introduced and optimized along with model parameters. The data parameters scale logits and control importance over classes and instances during training, which enables automatic curriculum learning without additional annotations for training data. Similarly, in this paper, we propose using this curriculum learning approach for acoustic modeling, and train an acoustic model on clean and noisy utterances with the data parameters. The proposed approach automatically learns the difficulty of the classes and instances, e.g. due to low speech to noise ratio (SNR), in the gradient descent optimization and performs curriculum learning. This curriculum learning leads to overall improvement of the accuracy of the acoustic model. We evaluate the effectiveness of the proposed approach on a keyword spotting task. Experimental results show 7.7% relative reduction in false reject ratio with the data parameters compared to a baseline model which is simply trained on the multiconditioned dataset.

* Accepted at ICASSP 2021 

  Access Paper or Ask Questions

A survey of joint intent detection and slot-filling models in natural language understanding

Jan 22, 2021
H. Weld, X. Huang, S. Long, J. Poon, S. C. Han

Intent classification and slot filling are two critical tasks for natural language understanding. Traditionally the two tasks have been deemed to proceed independently. However, more recently, joint models for intent classification and slot filling have achieved state-of-the-art performance, and have proved that there exists a strong relationship between the two tasks. This article is a compilation of past work in natural language understanding, especially joint intent classification and slot filling. We observe three milestones in this research so far: Intent detection to identify the speaker's intention, slot filling to label each word token in the speech/text, and finally, joint intent classification and slot filling tasks. In this article, we describe trends, approaches, issues, data sets, evaluation metrics in intent classification and slot filling. We also discuss representative performance values, describe shared tasks, and provide pointers to future work, as given in prior works. To interpret the state-of-the-art trends, we provide multiple tables that describe and summarise past research along different dimensions, including the types of features, base approaches, and dataset domain used.

* 33 pages 

  Access Paper or Ask Questions

Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger

Sep 11, 2020
Devin Hoesen, Ayu Purwarianti

Researches on Indonesian named entity (NE) tagger have been conducted since years ago. However, most did not use deep learning and instead employed traditional machine learning algorithms such as association rule, support vector machine, random forest, na\"ive bayes, etc. In those researches, word lists as gazetteers or clue words were provided to enhance the accuracy. Here, we attempt to employ deep learning in our Indonesian NE tagger. We use long short-term memory (LSTM) as the topology since it is the state-of-the-art of NE tagger. By using LSTM, we do not need a word list in order to enhance the accuracy. Basically, there are two main things that we investigate. The first is the output layer of the network: Softmax vs conditional random field (CRF). The second is the usage of part of speech (POS) tag embedding input layer. Using 8400 sentences as the training data and 97 sentences as the evaluation data, we find that using POS tag embedding as additional input improves the performance of our Indonesian NE tagger. As for the comparison between Softmax and CRF, we find that both architectures have a weakness in classifying an NE tag.

* 2018 International Conference on Asian Language Processing (IALP) 

  Access Paper or Ask Questions

Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks

Jun 12, 2020
Like Hui, Mikhail Belkin

Modern neural architectures for classification tasks are trained using the cross-entropy loss, which is believed to be empirically superior to the square loss. In this work we provide evidence indicating that this belief may not be well-founded. We explore several major neural architectures and a range of standard benchmark datasets for NLP, automatic speech recognition (ASR) and computer vision tasks to show that these architectures, with the same hyper-parameter settings as reported in the literature, perform comparably or better when trained with the square loss, even after equalizing computational resources. Indeed, we observe that the square loss produces better results in the dominant majority of NLP and ASR experiments. Cross-entropy appears to have a slight edge on computer vision tasks. We argue that there is little compelling empirical or theoretical evidence indicating a clear-cut advantage to the cross-entropy loss. Indeed, in our experiments, performance on nearly all non-vision tasks can be improved, sometimes significantly, by switching to the square loss. We posit that training using the square loss for classification needs to be a part of best practices of modern deep learning on equal footing with cross-entropy.

* 13 pages, 1 figure with 3 subfigures, 15 tables 

  Access Paper or Ask Questions

Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi

May 01, 2020
Benjamin Muller, Benoit Sagot, Djamé Seddah

Building natural language processing systems for non standardized and low resource languages is a difficult challenge. The recent success of large-scale multilingual pretrained language models provides new modeling tools to tackle this. In this work, we study the ability of multilingual language models to process an unseen dialect. We take user generated North-African Arabic as our case study, a resource-poor dialectal variety of Arabic with frequent code-mixing with French and written in Arabizi, a non-standardized transliteration of Arabic to Latin script. Focusing on two tasks, part-of-speech tagging and dependency parsing, we show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect, specifically in two extreme cases: (i) across scripts, using Modern Standard Arabic as a source language, and (ii) from a distantly related language, unseen during pretraining, namely Maltese. Our results constitute the first successful transfer experiments on this dialect, paving thus the way for the development of an NLP ecosystem for resource-scarce, non-standardized and highly variable vernacular languages.

  Access Paper or Ask Questions

Adversarial Machine Learning in Network Intrusion Detection Systems

Apr 23, 2020
Elie Alhajjar, Paul Maxwell, Nathaniel D. Bastian

Adversarial examples are inputs to a machine learning system intentionally crafted by an attacker to fool the model into producing an incorrect output. These examples have achieved a great deal of success in several domains such as image recognition, speech recognition and spam detection. In this paper, we study the nature of the adversarial problem in Network Intrusion Detection Systems (NIDS). We focus on the attack perspective, which includes techniques to generate adversarial examples capable of evading a variety of machine learning models. More specifically, we explore the use of evolutionary computation (particle swarm optimization and genetic algorithm) and deep learning (generative adversarial networks) as tools for adversarial example generation. To assess the performance of these algorithms in evading a NIDS, we apply them to two publicly available data sets, namely the NSL-KDD and UNSW-NB15, and we contrast them to a baseline perturbation method: Monte Carlo simulation. The results show that our adversarial example generation techniques cause high misclassification rates in eleven different machine learning models, along with a voting classifier. Our work highlights the vulnerability of machine learning based NIDS in the face of adversarial perturbation.

* 25 pages, 6 figures, 4 tables 

  Access Paper or Ask Questions

Improving Efficiency in Large-Scale Decentralized Distributed Training

Feb 04, 2020
Wei Zhang, Xiaodong Cui, Abdullah Kayi, Mingrui Liu, Ulrich Finkler, Brian Kingsbury, George Saon, Youssef Mroueh, Alper Buyuktosunoglu, Payel Das, David Kung, Michael Picheny

Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks. One drawback of (A)D-PSGD is that the spectral gap of the mixing matrix decreases when the number of learners in the system increases, which hampers convergence. In this paper, we investigate techniques to accelerate (A)D-PSGD based training by improving the spectral gap while minimizing the communication cost. We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task. On an IBM P9 supercomputer, our system is able to train an LSTM acoustic model in 2.28 hours with 7.5% WER on the Hub5-2000 Switchboard (SWB) test set and 13.3% WER on the CallHome (CH) test set using 64 V100 GPUs and in 1.98 hours with 7.7% WER on SWB and 13.3% WER on CH using 128 V100 GPUs, the fastest training time reported to date.

* 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP'2020) Oral 

  Access Paper or Ask Questions

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

Oct 28, 2019
Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, Aaron Courville

Previous works (Donahue et al., 2018a; Engel et al., 2019a) have found that generating coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques. Subjective evaluation metric (Mean Opinion Score, or MOS) shows the effectiveness of the proposed approach for high quality mel-spectrogram inversion. To establish the generality of the proposed techniques, we show qualitative results of our model in speech synthesis, music domain translation and unconditional music synthesis. We evaluate the various components of the model through ablation studies and suggest a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks. Our model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion. Our pytorch implementation runs at more than 100x faster than realtime on GTX 1080Ti GPU and more than 2x faster than real-time on CPU, without any hardware specific optimization tricks.

  Access Paper or Ask Questions

Random Directional Attack for Fooling Deep Neural Networks

Aug 06, 2019
Wenjian Luo, Chenwang Wu, Nan Zhou, Li Ni

Deep neural networks (DNNs) have been widely used in many fields such as images processing, speech recognition; however, they are vulnerable to adversarial examples, and this is a security issue worthy of attention. Because the training process of DNNs converge the loss by updating the weights along the gradient descent direction, many gradient-based methods attempt to destroy the DNN model by adding perturbations in the gradient direction. Unfortunately, as the model is nonlinear in most cases, the addition of perturbations in the gradient direction does not necessarily increase loss. Thus, we propose a random directed attack (RDA) for generating adversarial examples in this paper. Rather than limiting the gradient direction to generate an attack, RDA searches the attack direction based on hill climbing and uses multiple strategies to avoid local optima that cause attack failure. Compared with state-of-the-art gradient-based methods, the attack performance of RDA is very competitive. Moreover, RDA can attack without any internal knowledge of the model, and its performance under black-box attack is similar to that of the white-box attack in most cases, which is difficult to achieve using existing gradient-based attack methods.

* 13pages 

  Access Paper or Ask Questions