Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Incremental Adaptation Strategies for Neural Network Language Models

Jul 07, 2015
Aram Ter-Sarkisov, Holger Schwenk, Loic Barrault, Fethi Bougares

It is today acknowledged that neural network language models outperform backoff language models in applications like speech recognition or statistical machine translation. However, training these models on large amounts of data can take several days. We present efficient techniques to adapt a neural network language model to new data. Instead of training a completely new model or relying on mixture approaches, we propose two new methods: continued training on resampled data or insertion of adaptation layers. We present experimental results in an CAT environment where the post-edits of professional translators are used to improve an SMT system. Both methods are very fast and achieve significant improvements without overfitting the small adaptation data.

* accepted as workshop paper at ACL-IJCNLP 2015 

  Access Paper or Ask Questions

An Empirical Comparison of Parsing Methods for Stanford Dependencies

Apr 16, 2014
Lingpeng Kong, Noah A. Smith

Stanford typed dependencies are a widely desired representation of natural language sentences, but parsing is one of the major computational bottlenecks in text analysis systems. In light of the evolving definition of the Stanford dependencies and developments in statistical dependency parsing algorithms, this paper revisits the question of Cer et al. (2010): what is the tradeoff between accuracy and speed in obtaining Stanford dependencies in particular? We also explore the effects of input representations on this tradeoff: part-of-speech tags, the novel use of an alternative dependency representation as input, and distributional representaions of words. We find that direct dependency parsing is a more viable solution than it was found to be in the past. An accompanying software release can be found at:

* 13 pages, 2 figures 

  Access Paper or Ask Questions

A Novel Windowing Technique for Efficient Computation of MFCC for Speaker Recognition

Jun 12, 2012
Md. Sahidullah, Goutam Saha

In this paper, we propose a novel family of windowing technique to compute Mel Frequency Cepstral Coefficient (MFCC) for automatic speaker recognition from speech. The proposed method is based on fundamental property of discrete time Fourier transform (DTFT) related to differentiation in frequency domain. Classical windowing scheme such as Hamming window is modified to obtain derivatives of discrete time Fourier transform coefficients. It has been mathematically shown that the slope and phase of power spectrum are inherently incorporated in newly computed cepstrum. Speaker recognition systems based on our proposed family of window functions are shown to attain substantial and consistent performance improvement over baseline single tapered Hamming window as well as recently proposed multitaper windowing technique.

  Access Paper or Ask Questions

Apportioning Development Effort in a Probabilistic LR Parsing System through Evaluation

Apr 12, 1996
John Carroll, Ted Briscoe

We describe an implemented system for robust domain-independent syntactic parsing of English, using a unification-based grammar of part-of-speech and punctuation labels coupled with a probabilistic LR parser. We present evaluations of the system's performance along several different dimensions; these enable us to assess the contribution that each individual part is making to the success of the system as a whole, and thus prioritise the effort to be devoted to its further enhancement. Currently, the system is able to parse around 80% of sentences in a substantial corpus of general text containing a number of distinct genres. On a random sample of 250 such sentences the system has a mean crossing bracket rate of 0.71 and recall and precision of 83% and 84% respectively when evaluated against manually-disambiguated analyses.

* Conference on Empirical Methods in Natural Language Processing (EMNLP-96), 92-100 
* 10 pages, 1 Postscript figure. To Appear in Proceedings of the Conference on Empirical Methods in Natural Language Processing, University of Pennsylvania, May 1996 

  Access Paper or Ask Questions

Prepositional Phrase Attachment through a Backed-Off Model

Jun 22, 1995
Michael Collins, James Brooks

Recent work has considered corpus-based or statistical approaches to the problem of prepositional phrase attachment ambiguity. Typically, ambiguous verb phrases of the form {v np1 p np2} are resolved through a model which considers values of the four head words (v, n1, p and n2). This paper shows that the problem is analogous to n-gram language models in speech recognition, and that one of the most common methods for language modeling, the backed-off estimate, is applicable. Results on Wall Street Journal data of 84.5% accuracy are obtained using this method. A surprising result is the importance of low-count events - ignoring events which occur less than 5 times in training data reduces performance to 81.6%.

* To appear in Proceedings of the Third Workshop on Very Large Corpora, 12 pages, LaTeX 

  Access Paper or Ask Questions

Memory-Efficient Training of RNN-Transducer with Sampled Softmax

Mar 31, 2022
Jaesong Lee, Lukas Lee, Shinji Watanabe

RNN-Transducer has been one of promising architectures for end-to-end automatic speech recognition. Although RNN-Transducer has many advantages including its strong accuracy and streaming-friendly property, its high memory consumption during training has been a critical problem for development. In this work, we propose to apply sampled softmax to RNN-Transducer, which requires only a small subset of vocabulary during training thus saves its memory consumption. We further extend sampled softmax to optimize memory consumption for a minibatch, and employ distributions of auxiliary CTC losses for sampling vocabulary to improve model accuracy. We present experimental results on LibriSpeech, AISHELL-1, and CSJ-APS, where sampled softmax greatly reduces memory consumption and still maintains the accuracy of the baseline model.

* Submitted to INTERSPEECH 2022 

  Access Paper or Ask Questions

FLUTE: A Scalable, Extensible Framework for High-Performance Federated Learning Simulations

Mar 25, 2022
Dimitrios Dimitriadis, Mirian Hipolito Garcia, Daniel Madrigal Diaz, Andre Manoel, Robert Sim

In this paper we introduce "Federated Learning Utilities and Tools for Experimentation" (FLUTE), a high-performance open source platform for federated learning research and offline simulations. The goal of FLUTE is to enable rapid prototyping and simulation of new federated learning algorithms at scale, including novel optimization, privacy, and communications strategies. We describe the architecture of FLUTE, enabling arbitrary federated modeling schemes to be realized, we compare the platform with other state-of-the-art platforms, and we describe available features of FLUTE for experimentation in core areas of active research, such as optimization, privacy and scalability. We demonstrate the effectiveness of the platform with a series of experiments for text prediction and speech recognition, including the addition of differential privacy, quantization, scaling and a variety of optimization and federation approaches.

* 13 Pages, 5 Figures, 8 Tables 

  Access Paper or Ask Questions

Artificial Intelligence for Suicide Assessment using Audiovisual Cues: A Review

Jan 22, 2022
Sahraoui Dhelim, Liming Chen, Huansheng Ning, Chris Nugent

Death by suicide is the seventh of the leading death cause worldwide. The recent advancement in Artificial Intelligence (AI), specifically AI application in image and voice processing, has created a promising opportunity to revolutionize suicide risk assessment. Subsequently, we have witnessed fast-growing literature of researches that applies AI to extract audiovisual non-verbal cues for mental illness assessment. However, the majority of the recent works focus on depression, despite the evident difference between depression signs and suicidal behavior non-verbal cues. In this paper, we review the recent works that study suicide ideation and suicide behavior detection through audiovisual feature analysis, mainly suicidal voice/speech acoustic features analysis and suicidal visual cues.

* Manuscirpt submitted to Arificial Intelligence Reviews 

  Access Paper or Ask Questions

Applications of Recurrent Neural Network for Biometric Authentication & Anomaly Detection

Sep 13, 2021
Joseph M. Ackerson, Dave Rushit, Seliya Jim

Recurrent Neural Networks are powerful machine learning frameworks that allow for data to be saved and referenced in a temporal sequence. This opens many new possibilities in fields such as handwriting analysis and speech recognition. This paper seeks to explore current research being conducted on RNNs in four very important areas, being biometric authentication, expression recognition, anomaly detection, and applications to aircraft. This paper reviews the methodologies, purpose, results, and the benefits and drawbacks of each proposed method below. These various methodologies all focus on how they can leverage distinct RNN architectures such as the popular Long Short-Term Memory (LSTM) RNN or a Deep-Residual RNN. This paper also examines which frameworks work best in certain situations, and the advantages and disadvantages of each pro-posed model.

  Access Paper or Ask Questions

A Survey on Audio Synthesis and Audio-Visual Multimodal Processing

Aug 01, 2021
Zhaofeng Shi

With the development of deep learning and artificial intelligence, audio synthesis has a pivotal role in the area of machine learning and shows strong applicability in the industry. Meanwhile, significant efforts have been dedicated by researchers to handle multimodal tasks at present such as audio-visual multimodal processing. In this paper, we conduct a survey on audio synthesis and audio-visual multimodal processing, which helps understand current research and future trends. This review focuses on text to speech(TTS), music generation and some tasks that combine visual and acoustic information. The corresponding technical methods are comprehensively classified and introduced, and their future development trends are prospected. This survey can provide some guidance for researchers who are interested in the areas like audio synthesis and audio-visual multimodal processing.

  Access Paper or Ask Questions