Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

On the long-term learning ability of LSTM LMs

Jun 16, 2021
Wim Boes, Robbe Van Rompaey, Lyan Verwimp, Joris Pelemans, Hugo Van hamme, Patrick Wambacq

We inspect the long-term learning ability of Long Short-Term Memory language models (LSTM LMs) by evaluating a contextual extension based on the Continuous Bag-of-Words (CBOW) model for both sentence- and discourse-level LSTM LMs and by analyzing its performance. We evaluate on text and speech. Sentence-level models using the long-term contextual module perform comparably to vanilla discourse-level LSTM LMs. On the other hand, the extension does not provide gains for discourse-level models. These findings indicate that discourse-level LSTM LMs already rely on contextual information to perform long-term learning.

* ESANN 2020 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (2020) 625-630 

  Access Paper or Ask Questions

ADVISER: A Toolkit for Developing Multi-modal, Multi-domain and Socially-engaged Conversational Agents

May 04, 2020
Chia-Yu Li, Daniel Ortega, Dirk Väth, Florian Lux, Lindsey Vanderlyn, Maximilian Schmidt, Michael Neumann, Moritz Völkel, Pavel Denisov, Sabrina Jenne, Zorica Kacarevic, Ngoc Thang Vu

We present ADVISER - an open-source, multi-domain dialog system toolkit that enables the development of multi-modal (incorporating speech, text and vision), socially-engaged (e.g. emotion recognition, engagement level prediction and backchanneling) conversational agents. The final Python-based implementation of our toolkit is flexible, easy to use, and easy to extend not only for technically experienced users, such as machine learning researchers, but also for less technically experienced users, such as linguists or cognitive scientists, thereby providing a flexible platform for collaborative research. Link to open-source code:

* All authors contributed equally. Accepted to be presented at ACL - System demonstrations - 2020 

  Access Paper or Ask Questions

AlignNet: A Unifying Approach to Audio-Visual Alignment

Feb 12, 2020
Jianren Wang, Zhaoyuan Fang, Hang Zhao

We present AlignNet, a model that synchronizes videos with reference audios under non-uniform and irregular misalignments. AlignNet learns the end-to-end dense correspondence between each frame of a video and an audio. Our method is designed according to simple and well-established principles: attention, pyramidal processing, warping, and affinity function. Together with the model, we release a dancing dataset Dance50 for training and evaluation. Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that our method far outperforms the state-of-the-art methods. Project video and code are available at

* WACV2020. Project video and code are available at 

  Access Paper or Ask Questions

Compressive Transformers for Long-Range Sequence Modelling

Nov 13, 2019
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy P. Lillicrap

We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19.

* 19 pages, 6 figures, 10 tables 

  Access Paper or Ask Questions

Recurrent Neural Networks with Stochastic Layers for Acoustic Novelty Detection

Feb 13, 2019
Duong Nguyen, Oliver S. Kirsebom, Fábio Frazão, Ronan Fablet, Stan Matwin

In this paper, we adapt Recurrent Neural Networks with Stochastic Layers, which are the state-of-the-art for generating text, music and speech, to the problem of acoustic novelty detection. By integrating uncertainty into the hidden states, this type of network is able to learn the distribution of complex sequences. Because the learned distribution can be calculated explicitly in terms of probability, we can evaluate how likely an observation is then detect low-probability events as novel. The model is robust, highly unsupervised, end-to-end and requires minimum preprocessing, feature engineering or hyperparameter tuning. An experiment on a benchmark dataset shows that our model outperforms the state-of-the-art acoustic novelty detectors.

* Accepted to ICASSP 2019 

  Access Paper or Ask Questions

Hierarchical Neural Network Architecture In Keyword Spotting

Nov 06, 2018
Yixiao Qu, Sihao Xue, Zhenyi Ying, Hang Zhou, Jue Sun

Keyword Spotting (KWS) provides the start signal of ASR problem, and thus it is essential to ensure a high recall rate. However, its real-time property requires low computation complexity. This contradiction inspires people to find a suitable model which is small enough to perform well in multi environments. To deal with this contradiction, we implement the Hierarchical Neural Network(HNN), which is proved to be effective in many speech recognition problems. HNN outperforms traditional DNN and CNN even though its model size and computation complexity are slightly less. Also, its simple topology structure makes easy to deploy on any device.

* To be submitted in part to IEEE ICASSP 2019 

  Access Paper or Ask Questions

Generalizing Word Embeddings using Bag of Subwords

Sep 12, 2018
Jinman Zhao, Sidharth Mudgal, Yingyu Liang

We approach the problem of generalizing pre-trained word embeddings beyond fixed-size vocabularies without using additional contextual information. We propose a subword-level word vector generation model that views words as bags of character $n$-grams. The model is simple, fast to train and provides good vectors for rare or unseen words. Experiments show that our model achieves state-of-the-art performances in English word similarity task and in joint prediction of part-of-speech tag and morphosyntactic attributes in 23 languages, suggesting our model's ability in capturing the relationship between words' textual representations and their embeddings.

* Accepted to EMNLP 2018 

  Access Paper or Ask Questions

Classroom Video Assessment and Retrieval via Multiple Instance Learning

Mar 25, 2014
Qifeng Qiao, Peter A. Beling

We propose a multiple instance learning approach to content-based retrieval of classroom video for the purpose of supporting human assessing the learning environment. The key element of our approach is a mapping between the semantic concepts of the assessment system and features of the video that can be measured using techniques from the fields of computer vision and speech analysis. We report on a formative experiment in content-based video retrieval involving trained experts in the Classroom Assessment Scoring System, a widely used framework for assessment and improvement of learning environments. The results of this experiment suggest that our approach has potential application to productivity enhancement in assessment and to broader retrieval tasks.

* The 14th International Conference on Artificial Intelligence in Education 2011 

  Access Paper or Ask Questions

Improving neural networks by preventing co-adaptation of feature detectors

Jul 03, 2012
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov

When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

  Access Paper or Ask Questions

A Multimodal Biometric System Using Linear Discriminant Analysis For Improved Performance

Jan 18, 2012
Aamir Khan, Muhammad Farhan, Aasim Khurshid, Adeel Akram

Essentially a biometric system is a pattern recognition system which recognizes a user by determining the authenticity of a specific anatomical or behavioral characteristic possessed by the user. With the ever increasing integration of computers and Internet into daily life style, it has become necessary to protect sensitive and personal data. This paper proposes a multimodal biometric system which incorporates more than one biometric trait to attain higher security and to handle failure to enroll situations for some users. This paper is aimed at investigating a multimodal biometric identity system using Linear Discriminant Analysis as backbone to both facial and speech recognition and implementing such system in real-time using SignalWAVE.

* IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 6, No 2, 2011, 122-127 

  Access Paper or Ask Questions