Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

AlignNet: A Unifying Approach to Audio-Visual Alignment

Feb 12, 2020
Jianren Wang, Zhaoyuan Fang, Hang Zhao

We present AlignNet, a model that synchronizes videos with reference audios under non-uniform and irregular misalignments. AlignNet learns the end-to-end dense correspondence between each frame of a video and an audio. Our method is designed according to simple and well-established principles: attention, pyramidal processing, warping, and affinity function. Together with the model, we release a dancing dataset Dance50 for training and evaluation. Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that our method far outperforms the state-of-the-art methods. Project video and code are available at

* WACV2020. Project video and code are available at 

  Access Paper or Ask Questions

Compressive Transformers for Long-Range Sequence Modelling

Nov 13, 2019
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Timothy P. Lillicrap

We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19.

* 19 pages, 6 figures, 10 tables 

  Access Paper or Ask Questions

Recurrent Neural Networks with Stochastic Layers for Acoustic Novelty Detection

Feb 13, 2019
Duong Nguyen, Oliver S. Kirsebom, Fábio Frazão, Ronan Fablet, Stan Matwin

In this paper, we adapt Recurrent Neural Networks with Stochastic Layers, which are the state-of-the-art for generating text, music and speech, to the problem of acoustic novelty detection. By integrating uncertainty into the hidden states, this type of network is able to learn the distribution of complex sequences. Because the learned distribution can be calculated explicitly in terms of probability, we can evaluate how likely an observation is then detect low-probability events as novel. The model is robust, highly unsupervised, end-to-end and requires minimum preprocessing, feature engineering or hyperparameter tuning. An experiment on a benchmark dataset shows that our model outperforms the state-of-the-art acoustic novelty detectors.

* Accepted to ICASSP 2019 

  Access Paper or Ask Questions

Hierarchical Neural Network Architecture In Keyword Spotting

Nov 06, 2018
Yixiao Qu, Sihao Xue, Zhenyi Ying, Hang Zhou, Jue Sun

Keyword Spotting (KWS) provides the start signal of ASR problem, and thus it is essential to ensure a high recall rate. However, its real-time property requires low computation complexity. This contradiction inspires people to find a suitable model which is small enough to perform well in multi environments. To deal with this contradiction, we implement the Hierarchical Neural Network(HNN), which is proved to be effective in many speech recognition problems. HNN outperforms traditional DNN and CNN even though its model size and computation complexity are slightly less. Also, its simple topology structure makes easy to deploy on any device.

* To be submitted in part to IEEE ICASSP 2019 

  Access Paper or Ask Questions

Generalizing Word Embeddings using Bag of Subwords

Sep 12, 2018
Jinman Zhao, Sidharth Mudgal, Yingyu Liang

We approach the problem of generalizing pre-trained word embeddings beyond fixed-size vocabularies without using additional contextual information. We propose a subword-level word vector generation model that views words as bags of character $n$-grams. The model is simple, fast to train and provides good vectors for rare or unseen words. Experiments show that our model achieves state-of-the-art performances in English word similarity task and in joint prediction of part-of-speech tag and morphosyntactic attributes in 23 languages, suggesting our model's ability in capturing the relationship between words' textual representations and their embeddings.

* Accepted to EMNLP 2018 

  Access Paper or Ask Questions

Classroom Video Assessment and Retrieval via Multiple Instance Learning

Mar 25, 2014
Qifeng Qiao, Peter A. Beling

We propose a multiple instance learning approach to content-based retrieval of classroom video for the purpose of supporting human assessing the learning environment. The key element of our approach is a mapping between the semantic concepts of the assessment system and features of the video that can be measured using techniques from the fields of computer vision and speech analysis. We report on a formative experiment in content-based video retrieval involving trained experts in the Classroom Assessment Scoring System, a widely used framework for assessment and improvement of learning environments. The results of this experiment suggest that our approach has potential application to productivity enhancement in assessment and to broader retrieval tasks.

* The 14th International Conference on Artificial Intelligence in Education 2011 

  Access Paper or Ask Questions

Improving neural networks by preventing co-adaptation of feature detectors

Jul 03, 2012
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov

When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

  Access Paper or Ask Questions

A Multimodal Biometric System Using Linear Discriminant Analysis For Improved Performance

Jan 18, 2012
Aamir Khan, Muhammad Farhan, Aasim Khurshid, Adeel Akram

Essentially a biometric system is a pattern recognition system which recognizes a user by determining the authenticity of a specific anatomical or behavioral characteristic possessed by the user. With the ever increasing integration of computers and Internet into daily life style, it has become necessary to protect sensitive and personal data. This paper proposes a multimodal biometric system which incorporates more than one biometric trait to attain higher security and to handle failure to enroll situations for some users. This paper is aimed at investigating a multimodal biometric identity system using Linear Discriminant Analysis as backbone to both facial and speech recognition and implementing such system in real-time using SignalWAVE.

* IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 6, No 2, 2011, 122-127 

  Access Paper or Ask Questions

A Computational Memory and Processing Model for Processing

Apr 24, 1999
Janet E. Cahn

This paper links prosody to the information in a text and how it is processed by the speaker. It describes the operation and output of LOQ, a text-to-speech implementation that includes a model of limited attention and working memory. Attentional limitations are key. Varying the attentional parameter in the simulations varies in turn what counts as given and new in a text, and therefore, the intonational contours with which it is uttered. Currently, the system produces prosody in three different styles: child-like, adult expressive, and knowledgeable. This prosody also exhibits differences within each style -- no two simulations are alike. The limited resource approach captures some of the stylistic and individual variety found in natural prosody.

* 4 pages, 5 figures 

  Access Paper or Ask Questions

Word Sense Disambiguation using Optimised Combinations of Knowledge Sources

Jun 22, 1998
Yorick Wilks, Mark Stevenson

Word sense disambiguation algorithms, with few exceptions, have made use of only one lexical knowledge source. We describe a system which performs unrestricted word sense disambiguation (on all content words in free text) by combining different knowledge sources: semantic preferences, dictionary definitions and subject/domain codes along with part-of-speech tags. The usefulness of these sources is optimised by means of a learning algorithm. We also describe the creation of a new sense tagged corpus by combining existing resources. Tested accuracy of our approach on this corpus exceeds 92%, demonstrating the viability of all-word disambiguation rather than restricting oneself to a small sample.

* 7 pages, uses colacl.sty. To appear in the Proceedings of COLING-ACL '98 

  Access Paper or Ask Questions