Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Age Group Classification with Speech and Metadata Multimodality Fusion

Mar 02, 2018
Denys Katerenchuk

Children comprise a significant proportion of TV viewers and it is worthwhile to customize the experience for them. However, identifying who is a child in the audience can be a challenging task. Identifying gender and age from audio commands is a well-studied problem but is still very challenging to get good accuracy when the utterances are typically only a couple of seconds long. We present initial studies of a novel method which combines utterances with user metadata. In particular, we develop an ensemble of different machine learning techniques on different subsets of data to improve child detection. Our initial results show a 9.2\% absolute improvement over the baseline, leading to a state-of-the-art performance.

* Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017 

  Access Paper or Ask Questions

Lexical representation explains cortical entrainment during speech comprehension

Jan 10, 2018
Stefan Frank, Jinbiao Yang

Results from a recent neuroimaging study on spoken sentence comprehension have been interpreted as evidence for cortical entrainment to hierarchical syntactic structure. We present a simple computational model that predicts the power spectra from this study, even though the model's linguistic knowledge is restricted to the lexical level, and word-level representations are not combined into higher-level units (phrases or sentences). Hence, the cortical entrainment results can also be explained from the lexical properties of the stimuli, without recourse to hierarchical syntax.

* Submitted for publication 

  Access Paper or Ask Questions

Lip Localization and Viseme Classification for Visual Speech Recognition

Jan 19, 2013
Salah Werda, Walid Mahdi, Abdelmajid Ben Hamadou

The need for an automatic lip-reading system is ever increasing. Infact, today, extraction and reliable analysis of facial movements make up an important part in many multimedia systems such as videoconference, low communication systems, lip-reading systems. In addition, visual information is imperative among people with special needs. We can imagine, for example, a dependent person ordering a machine with an easy lip movement or by a simple syllable pronunciation. Moreover, people with hearing problems compensate for their special needs by lip-reading as well as listening to the person with whome they are talking.

* International Journal of Computing and Information Sciences ISSN: 1708-0460 (print) - 1708-0479 (online) Volume 5, Number 1, December 2007 
* 14 pages 

  Access Paper or Ask Questions

Speech Recognition: Increasing Efficiency of Support Vector Machines

Apr 19, 2012
Aamir Khan, Muhammad Farhan, Asar Ali

With the advancement of communication and security technologies, it has become crucial to have robustness of embedded biometric systems. This paper presents the realization of such technologies which demands reliable and error-free biometric identity verification systems. High dimensional patterns are not permitted due to eigen-decomposition in high dimensional feature space and degeneration of scattering matrices in small size sample. Generalization, dimensionality reduction and maximizing the margins are controlled by minimizing weight vectors. Results show good pattern by multimodal biometric system proposed in this paper. This paper is aimed at investigating a biometric identity system using Support Vector Machines(SVMs) and Lindear Discriminant Analysis(LDA) with MFCCs and implementing such system in real-time using SignalWAVE.

* International Journal of Computer Applications 35(7):17-21, December 2011 
* 5 pages, 11 figures. arXiv admin note: text overlap with arXiv:1201.3720 and arXiv:1204.1177 

  Access Paper or Ask Questions

Disambiguation of Super Parts of Speech (or Supertags): Almost Parsing

Oct 26, 1994
Aravind K. Joshi, B. Srinivas

In a lexicalized grammar formalism such as Lexicalized Tree-Adjoining Grammar (LTAG), each lexical item is associated with at least one elementary structure (supertag) that localizes syntactic and semantic dependencies. Thus a parser for a lexicalized grammar must search a large set of supertags to choose the right ones to combine for the parse of the sentence. We present techniques for disambiguating supertags using local information such as lexical preference and local lexical dependencies. The similarity between LTAG and Dependency grammars is exploited in the dependency model of supertag disambiguation. The performance results for various models of supertag disambiguation such as unigram, trigram and dependency-based models are presented.

* Proceedings of the 15th International Conference on Computational Linguistics (COLING 94), Kyoto, Japan, August 1994 
* ps file. 8 pages 

  Access Paper or Ask Questions

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Mar 28, 2022
Puyuan Peng, David Harwath

We present a method for visually-grounded spoken term discovery. After training either a HuBERT or wav2vec2.0 model to associate spoken captions with natural images, we show that powerful word segmentation and clustering capability emerges within the model's self-attention heads. Our experiments reveal that this ability is not present to nearly the same extent in the base HuBERT and wav2vec2.0 models, suggesting that the visual grounding task is a crucial component of the word discovery capability we observe. We also evaluate our method on the Buckeye word segmentation and ZeroSpeech spoken term discovery tasks, where we outperform all currently published methods on several metrics.

* submitted to Interspeech 2022 

  Access Paper or Ask Questions

Mixed Precision of Quantization of Transformer Language Models for Speech Recognition

Nov 29, 2021
Junhao Xu, Shoukang Hu, Jianwei Yu, Xunying Liu, Helen Meng

State-of-the-art neural language models represented by Transformers are becoming increasingly complex and expensive for practical applications. Low-bit deep neural network quantization techniques provides a powerful solution to dramatically reduce their model size. Current low-bit quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of the system to quantization errors. To this end, novel mixed precision DNN quantization methods are proposed in this paper. The optimal local precision settings are automatically learned using two techniques. The first is based on a quantization sensitivity metric in the form of Hessian trace weighted quantization perturbation. The second is based on mixed precision Transformer architecture search. Alternating direction methods of multipliers (ADMM) are used to efficiently train mixed precision quantized DNN systems. Experiments conducted on Penn Treebank (PTB) and a Switchboard corpus trained LF-MMI TDNN system suggest the proposed mixed precision Transformer quantization techniques achieved model size compression ratios of up to 16 times over the full precision baseline with no recognition performance degradation. When being used to compress a larger full precision Transformer LM with more layers, overall word error rate (WER) reductions up to 1.7% absolute (18% relative) were obtained.

* arXiv admin note: substantial text overlap with arXiv:2112.11438, arXiv:2111.14479 

  Access Paper or Ask Questions

Discrete acoustic space for an efficient sampling in neural text-to-speech

Oct 24, 2021
Marek Strelec, Jonas Rohnke, Antonio Bonafonte, Mateusz Łajszczak, Trevor Wood

We present an SVQ-VAE architecture using a split vector quantizer for NTTS, as an enhancement to the well-known VAE and VQ-VAE architectures. Compared to these previous architectures, our proposed model retains the benefits of using an utterance-level bottleneck, while reducing the associated loss of representation power. We train the model on recordings in the highly expressive task-oriented dialogues domain and show that SVQ-VAE achieves a statistically significant improvement in naturalness over the VAE and VQ-VAE models. Furthermore, we demonstrate that the SVQ-VAE acoustic space is predictable from text, reducing the gap between the standard constant vector synthesis and vocoded recordings by 32%.

* Submitted to ICASSP 2022 

  Access Paper or Ask Questions

On Language Model Integration for RNN Transducer based Speech Recognition

Oct 13, 2021
Wei Zhou, Zuoyun Zheng, Ralf Schlüter, Hermann Ney

The mismatch between an external language model (LM) and the implicitly learned internal LM (ILM) of RNN-Transducer (RNN-T) can limit the performance of LM integration such as simple shallow fusion. A Bayesian interpretation suggests to remove this sequence prior as ILM correction. In this work, we study various ILM correction-based LM integration methods formulated in a common RNN-T framework. We provide a decoding interpretation on two major reasons for performance improvement with ILM correction, which is further experimentally verified with detailed analysis. We also propose an exact-ILM training framework by extending the proof given in the hybrid autoregressive transducer, which enables a theoretical justification for other ILM approaches. Systematic comparison is conducted for both in-domain and cross-domain evaluation on the Librispeech and TED-LIUM Release 2 corpora, respectively. Our proposed exact-ILM training can further improve the best ILM method.

* submitted to ICASSP2022 

  Access Paper or Ask Questions

Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

Oct 13, 2021
Zaid Alyafeai, Maraim Masoud, Mustafa Ghaleb, Maged S. Al-shaibani

The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper we create \textit{Masader}, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, We develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.

  Access Paper or Ask Questions