Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Raw Waveform-based Audio Classification Using Sample-level CNN Architectures

Dec 04, 2017
Jongpil Lee, Taejun Kim, Jiyoung Park, Juhan Nam

Music, speech, and acoustic scene sound are often handled separately in the audio domain because of their different signal characteristics. However, as the image domain grows rapidly by versatile image classification models, it is necessary to study extensible classification models in the audio domain as well. In this study, we approach this problem using two types of sample-level deep convolutional neural networks that take raw waveforms as input and uses filters with small granularity. One is a basic model that consists of convolution and pooling layers. The other is an improved model that additionally has residual connections, squeeze-and-excitation modules and multi-level concatenation. We show that the sample-level models reach state-of-the-art performance levels for the three different categories of sound. Also, we visualize the filters along layers and compare the characteristics of learned filters.

* NIPS, Machine Learning for Audio Signal Processing Workshop (ML4Audio), 2017 

  Access Paper or Ask Questions

Learning Hard Alignments with Variational Inference

Nov 01, 2017
Dieterich Lawson, Chung-Cheng Chiu, George Tucker, Colin Raffel, Kevin Swersky, Navdeep Jaitly

There has recently been significant interest in hard attention models for tasks such as object recognition, visual captioning and speech recognition. Hard attention can offer benefits over soft attention such as decreased computational cost, but training hard attention models can be difficult because of the discrete latent variables they introduce. Previous work used REINFORCE and Q-learning to approach these issues, but those methods can provide high-variance gradient estimates and be slow to train. In this paper, we tackle the problem of learning hard attention for a sequential task using variational inference methods, specifically the recently introduced VIMCO and NVIL. Furthermore, we propose a novel baseline that adapts VIMCO to this setting. We demonstrate our method on a phoneme recognition task in clean and noisy environments and show that our method outperforms REINFORCE, with the difference being greater for a more complicated task.

  Access Paper or Ask Questions

A Simple Text Analytics Model To Assist Literary Criticism: comparative approach and example on James Joyce against Shakespeare and the Bible

Oct 24, 2017
Renato Fabbri, Luis Henrique Garcia

Literary analysis, criticism or studies is a largely valued field with dedicated journals and researchers which remains mostly within the humanities scope. Text analytics is the computer-aided process of deriving information from texts. In this article we describe a simple and generic model for performing literary analysis using text analytics. The method relies on statistical measures of: 1) token and sentence sizes and 2) Wordnet synset features. These measures are then used in Principal Component Analysis where the texts to be analyzed are observed against Shakespeare and the Bible, regarded as reference literature. The model is validated by analyzing selected works from James Joyce (1882-1941), one of the most important writers of the 20th century. We discuss the consistency of this approach, the reasons why we did not use other techniques (e.g. part-of-speech tagging) and the ways by which the analysis model might be adapted and enhanced.

* Anais do XX ENMC - Encontro Nacional de Modelagem Computacional e VIII ECTM - Encontro de Ci\^encias e Tecnologia de Materiais, Nova Friburgo, RJ - 16 a 19 Outubro 2017 
* Scripts and corpus in 

  Access Paper or Ask Questions

Keynote: Small Neural Nets Are Beautiful: Enabling Embedded Systems with Small Deep-Neural-Network Architectures

Oct 07, 2017
Forrest Iandola, Kurt Keutzer

Over the last five years Deep Neural Nets have offered more accurate solutions to many problems in speech recognition, and computer vision, and these solutions have surpassed a threshold of acceptability for many applications. As a result, Deep Neural Networks have supplanted other approaches to solving problems in these areas, and enabled many new applications. While the design of Deep Neural Nets is still something of an art form, in our work we have found basic principles of design space exploration used to develop embedded microprocessor architectures to be highly applicable to the design of Deep Neural Net architectures. In particular, we have used these design principles to create a novel Deep Neural Net called SqueezeNet that requires as little as 480KB of storage for its model parameters. We have further integrated all these experiences to develop something of a playbook for creating small Deep Neural Nets for embedded systems.

* Keynote at Embedded Systems Week (ESWEEK) 2017 

  Access Paper or Ask Questions

Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning

Aug 03, 2017
Mike Kestemont, Jeroen De Gussem

In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization. These are both basic, yet foundational preprocessing steps in applications such as text re-use detection. Nevertheless, they are generally complicated by the considerable orthographic variation which is typical of medieval Latin. In Digital Classics, these tasks are traditionally solved in a (i) cascaded and (ii) lexicon-dependent fashion. For example, a lexicon is used to generate all the potential lemma-tag pairs for a token, and next, a context-aware PoS-tagger is used to select the most appropriate tag-lemma pair. Apart from the problems with out-of-lexicon items, error percolation is a major downside of such approaches. In this paper we explore the possibility to elegantly solve these tasks using a single, integrated approach. For this, we make use of a layered neural network architecture from the field of deep representation learning.

* Journal of Data Mining & Digital Humanities, Special Issue on Computer-Aided Processing of Intertextuality in Ancient Languages, Towards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similarities (August 6, 2017) jdmdh:3835 

  Access Paper or Ask Questions

Deep Learning for Time-Series Analysis

Jan 07, 2017
John Cristian Borges Gamboa

In many real-world application, e.g., speech recognition or sleep stage classification, data are captured over the course of time, constituting a Time-Series. Time-Series often contain temporal dependencies that cause two otherwise identical points of time to belong to different classes or predict different behavior. This characteristic generally increases the difficulty of analysing them. Existing techniques often depended on hand-crafted features that were expensive to create and required expert knowledge of the field. With the advent of Deep Learning new models of unsupervised learning of features for Time-series analysis and forecast have been developed. Such new developments are the topic of this paper: a review of the main Deep Learning techniques is presented, and some applications on Time-Series analysis are summaried. The results make it clear that Deep Learning has a lot to contribute to the field.

* Written as part of the Seminar on Collaborative Intelligence in the TU Kaiserslautern. January 2016 

  Access Paper or Ask Questions

STC Anti-spoofing Systems for the ASVspoof 2015 Challenge

Jul 29, 2015
Sergey Novoselov, Alexandr Kozlov, Galina Lavrentyeva, Konstantin Simonchik, Vadim Shchemelinin

This paper presents the Speech Technology Center (STC) systems submitted to Automatic Speaker Verification Spoofing and Countermeasures (ASVspoof) Challenge 2015. In this work we investigate different acoustic feature spaces to determine reliable and robust countermeasures against spoofing attacks. In addition to the commonly used front-end MFCC features we explored features derived from phase spectrum and features based on applying the multiresolution wavelet transform. Similar to state-of-the-art ASV systems, we used the standard TV-JFA approach for probability modelling in spoofing detection systems. Experiments performed on the development and evaluation datasets of the Challenge demonstrate that the use of phase-related and wavelet-based features provides a substantial input into the efficiency of the resulting STC systems. In our research we also focused on the comparison of the linear (SVM) and nonlinear (DBN) classifiers.

* 5 pages, 8 figures, 3 tables 

  Access Paper or Ask Questions

Overview of the NLPCC 2015 Shared Task: Chinese Word Segmentation and POS Tagging for Micro-blog Texts

Jun 30, 2015
Xipeng Qiu, Peng Qian, Liusong Yin, Shiyu Wu, Xuanjing Huang

In this paper, we give an overview for the shared task at the 4th CCF Conference on Natural Language Processing \& Chinese Computing (NLPCC 2015): Chinese word segmentation and part-of-speech (POS) tagging for micro-blog texts. Different with the popular used newswire datasets, the dataset of this shared task consists of the relatively informal micro-texts. The shared task has two sub-tasks: (1) individual Chinese word segmentation and (2) joint Chinese word segmentation and POS Tagging. Each subtask has three tracks to distinguish the systems with different resources. We first introduce the dataset and task, then we characterize the different approaches of the participating systems, report the test results, and provide a overview analysis of these results. An online system is available for open registration and evaluation at

  Access Paper or Ask Questions

Long Short-Term Memory Over Tree Structures

Mar 16, 2015
Xiaodan Zhu, Parinaz Sobhani, Hongyu Guo

The chain-structured long short-term memory (LSTM) has showed to be effective in a wide range of problems such as speech recognition and machine translation. In this paper, we propose to extend it to tree structures, in which a memory cell can reflect the history memories of multiple child cells or multiple descendant cells in a recursive process. We call the model S-LSTM, which provides a principled way of considering long-distance interaction over hierarchies, e.g., language or image parse structures. We leverage the models for semantic composition to understand the meaning of text, a fundamental problem in natural language understanding, and show that it outperforms a state-of-the-art recursive model by replacing its composition layers with the S-LSTM memory blocks. We also show that utilizing the given structures is helpful in achieving a performance better than that without considering the structures.

* On February 6th, 2015, this work was submitted to the International Conference on Machine Learning (ICML) 

  Access Paper or Ask Questions

Reduplicated MWE (RMWE) helps in improving the CRF based Manipuri POS Tagger

Mar 22, 2012
Kishorjit Nongmeikapam, Lairenlakpam Nonglenjaoba, Yumnam Nirmal, Sivaji Bandyopadhyay

This paper gives a detail overview about the modified features selection in CRF (Conditional Random Field) based Manipuri POS (Part of Speech) tagging. Selection of features is so important in CRF that the better are the features then the better are the outputs. This work is an attempt or an experiment to make the previous work more efficient. Multiple new features are tried to run the CRF and again tried with the Reduplicated Multiword Expression (RMWE) as another feature. The CRF run with RMWE because Manipuri is rich of RMWE and identification of RMWE becomes one of the necessities to bring up the result of POS tagging. The new CRF system shows a Recall of 78.22%, Precision of 73.15% and F-measure of 75.60%. With the identification of RMWE and considering it as a feature makes an improvement to a Recall of 80.20%, Precision of 74.31% and F-measure of 77.14%.

* 15 pages, 4 tables, 2 figures, the link arXiv admin note: text overlap with arXiv:1111.2399 

  Access Paper or Ask Questions