Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Analysis of Dominant Classes in Universal Adversarial Perturbations

Dec 28, 2020
Jon Vadillo, Roberto Santana, Jose A. Lozano

The reasons why Deep Neural Networks are susceptible to being fooled by adversarial examples remains an open discussion. Indeed, many different strategies can be employed to efficiently generate adversarial attacks, some of them relying on different theoretical justifications. Among these strategies, universal (input-agnostic) perturbations are of particular interest, due to their capability to fool a network independently of the input in which the perturbation is applied. In this work, we investigate an intriguing phenomenon of universal perturbations, which has been reported previously in the literature, yet without a proven justification: universal perturbations change the predicted classes for most inputs into one particular (dominant) class, even if this behavior is not specified during the creation of the perturbation. In order to justify the cause of this phenomenon, we propose a number of hypotheses and experimentally test them using a speech command classification problem in the audio domain as a testbed. Our analyses reveal interesting properties of universal perturbations, suggest new methods to generate such attacks and provide an explanation of dominant classes, under both a geometric and a data-feature perspective.

* 20 pages, 10 figures, 4 tables 

  Access Paper or Ask Questions

Incorporating Domain Knowledge To Improve Topic Segmentation Of Long MOOC Lecture Videos

Dec 08, 2020
Ananda Das, Partha Pratim Das

Topical Segmentation poses a great role in reducing search space of the topics taught in a lecture video specially when the video metadata lacks topic wise segmentation information. This segmentation information eases user efforts of searching, locating and browsing a topic inside a lecture video. In this work we propose an algorithm, that combines state-of-the art language model and domain knowledge graph for automatically detecting different coherent topics present inside a long lecture video. We use the language model on speech-to-text transcription to capture the implicit meaning of the whole video while the knowledge graph provides us the domain specific dependencies between different concepts of that subjects. Also leveraging the domain knowledge we can capture the way instructor binds and connects different concepts while teaching, which helps us in achieving better segmentation accuracy. We tested our approach on NPTEL lecture videos and holistic evaluation shows that it out performs the other methods described in the literature.

  Access Paper or Ask Questions

The design and implementation of Language Learning Chatbot with XAI using Ontology and Transfer Learning

Sep 29, 2020
Nuobei Shi, Qin Zeng, Raymond Lee

In this paper, we proposed a transfer learning-based English language learning chatbot, whose output generated by GPT-2 can be explained by corresponding ontology graph rooted by fine-tuning dataset. We design three levels for systematically English learning, including phonetics level for speech recognition and pronunciation correction, semantic level for specific domain conversation, and the simulation of free-style conversation in English - the highest level of language chatbot communication as free-style conversation agent. For academic contribution, we implement the ontology graph to explain the performance of free-style conversation, following the concept of XAI (Explainable Artificial Intelligence) to visualize the connections of neural network in bionics, and explain the output sentence from language model. From implementation perspective, our Language Learning agent integrated the mini-program in WeChat as front-end, and fine-tuned GPT-2 model of transfer learning as back-end to interpret the responses by ontology graph.

* Dhinaharan Nagamalai et al. (Eds): CSEIT, WiMoNe, NCS, CIoT, CMLA, DMSE, NLPD - 2020 pp. 305-323, 2020. CS & IT - CSCP 2020 
* 19 pages, 20 figures, published paper in International Conference on NLP & Big Data (NLPD 2020) 

  Access Paper or Ask Questions

Speaker Diarization as a Fully Online Learning Problem in MiniVox

Jun 08, 2020
Baihan Lin, Xinxin Zhang

We proposed a novel AI framework to conduct real-time multi-speaker diarization and recognition without prior registration and pretraining in a fully online learning setting. Our contributions are two-fold. First, we proposed a new benchmark to evaluate the rarely studied fully online speaker diarization problem. We built upon existing datasets of real world utterances to automatically curate MiniVox, an experimental environment which generates infinite configurations of continuous multi-speaker speech stream. Secondly, we considered the practical problem of online learning with episodically revealed rewards and introduced a solution based on semi-supervised and self-supervised learning methods. Lastly, we provided a workable web-based recognition system which interactively handles the cold start problem of new user's addition by transferring representations of old arms to new ones with an extendable contextual bandit. We demonstrated that our proposed method obtained robust performance in the online MiniVox framework.

  Access Paper or Ask Questions

Wake Word Detection with Alignment-Free Lattice-Free MMI

May 25, 2020
Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur

Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data, and to use it in on-line applications: (i) we remove the prerequisite of frame-level alignments in the LF-MMI training algorithm, permitting the use of un-transcribed training examples that are annotated only for the presence/absence of the wake word; (ii) we show that the classical keyword/filler model must be supplemented with an explicit non-speech (silence) model for good performance; (iii) we present an FST-based decoder to perform online detection. We evaluate our methods on two real data sets, showing 50%--90% reduction in false rejection rates at pre-specified false alarm rates over the best previously published figures, and re-validate them on a third (large) data set.

* Submitted to Interspeech 2020. 5 pages, 3 figures 

  Access Paper or Ask Questions

Vector-Quantized Autoregressive Predictive Coding

May 17, 2020
Yu-An Chung, Hao Tang, James Glass

Autoregressive Predictive Coding (APC), as a self-supervised objective, has enjoyed success in learning representations from large amounts of unlabeled data, and the learned representations are rich for many downstream tasks. However, the connection between low self-supervised loss and strong performance in downstream tasks remains unclear. In this work, we propose Vector-Quantized Autoregressive Predictive Coding (VQ-APC), a novel model that produces quantized representations, allowing us to explicitly control the amount of information encoded in the representations. By studying a sequence of increasingly limited models, we reveal the constituents of the learned representations. In particular, we confirm the presence of information with probing tasks, while showing the absence of information with mutual information, uncovering the model's preference in preserving speech information as its capacity becomes constrained. We find that there exists a point where phonetic and speaker information are amplified to maximize a self-supervised objective. As a byproduct, the learned codes for a particular model capacity correspond well to English phones.

  Access Paper or Ask Questions

A Framework for the Computational Linguistic Analysis of Dehumanization

Mar 06, 2020
Julia Mendelsohn, Yulia Tsvetkov, Dan Jurafsky

Dehumanization is a pernicious psychological process that often leads to extreme intergroup bias, hate speech, and violence aimed at targeted social groups. Despite these serious consequences and the wealth of available data, dehumanization has not yet been computationally studied on a large scale. Drawing upon social psychology research, we create a computational linguistic framework for analyzing dehumanizing language by identifying linguistic correlates of salient components of dehumanization. We then apply this framework to analyze discussions of LGBTQ people in the New York Times from 1986 to 2015. Overall, we find increasingly humanizing descriptions of LGBTQ people over time. However, we find that the label homosexual has emerged to be much more strongly associated with dehumanizing attitudes than other labels, such as gay. Our proposed techniques highlight processes of linguistic variation and change in discourses surrounding marginalized groups. Furthermore, the ability to analyze dehumanizing language at a large scale has implications for automatically detecting and understanding media bias as well as abusive language online.

* 30 pages, 8 figures (Appendix is 3 pages, 2 figures). Submitted to Frontiers in Artificial Intelligence (Language and Computation) 

  Access Paper or Ask Questions

Phrase-Level Class based Language Model for Mandarin Smart Speaker Query Recognition

Sep 02, 2019
Yiheng Huang, Liqiang He, Lei Han, Guangsen Wang, Dan Su

The success of speech assistants requires precise recognition of a number of entities on particular contexts. A common solution is to train a class-based n-gram language model and then expand the classes into specific words or phrases. However, when the class has a huge list, e.g., more than 20 million songs, a fully expansion will cause memory explosion. Worse still, the list items in the class need to be updated frequently, which requires a dynamic model updating technique. In this work, we propose to train pruned language models for the word classes to replace the slots in the root n-gram. We further propose to use a novel technique, named Difference Language Model (DLM), to correct the bias from the pruned language models. Once the decoding graph is built, we only need to recalculate the DLM when the entities in word classes are updated. Results show that the proposed method consistently and significantly outperforms the conventional approaches on all datasets, esp. for large lists, which the conventional approaches cannot handle.

* 5 pages, 3 figures and 3 tables 

  Access Paper or Ask Questions

From Text to Sound: A Preliminary Study on Retrieving Sound Effects to Radio Stories

Aug 20, 2019
Songwei Ge, Curtis Xuan, Ruihua Song, Chao Zou, Wei Liu, Jin Zhou

Sound effects play an essential role in producing high-quality radio stories but require enormous labor cost to add. In this paper, we address the problem of automatically adding sound effects to radio stories with a retrieval-based model. However, directly implementing a tag-based retrieval model leads to high false positives due to the ambiguity of story contents. To solve this problem, we introduce a retrieval-based framework hybridized with a semantic inference model which helps to achieve robust retrieval results. Our model relies on fine-designed features extracted from the context of candidate triggers. We collect two story dubbing datasets through crowdsourcing to analyze the setting of adding sound effects and to train and test our proposed methods. We further discuss the importance of each feature and introduce several heuristic rules for the trade-off between precision and recall. Together with the text-to-speech technology, our results reveal a promising automatic pipeline on producing high-quality radio stories.

* In the Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) 

  Access Paper or Ask Questions