Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Phonetic and Graphemic Systems for Multi-Genre Broadcast Transcription

Feb 01, 2018
Yu Wang, Xie Chen, Mark Gales, Anton Ragni, Jeremy Wong

State-of-the-art English automatic speech recognition systems typically use phonetic rather than graphemic lexicons. Graphemic systems are known to perform less well for English as the mapping from the written form to the spoken form is complicated. However, in recent years the representational power of deep-learning based acoustic models has improved, raising interest in graphemic acoustic models for English, due to the simplicity of generating the lexicon. In this paper, phonetic and graphemic models are compared for an English Multi-Genre Broadcast transcription task. A range of acoustic models based on lattice-free MMI training are constructed using phonetic and graphemic lexicons. For this task, it is found that having a long-span temporal history reduces the difference in performance between the two forms of models. In addition, system combination is examined, using parameter smoothing and hypothesis combination. As the combination approaches become more complicated the difference between the phonetic and graphemic systems further decreases. Finally, for all configurations examined the combination of phonetic and graphemic systems yields consistent gains.

* 5 pages, 6 tables, to appear in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018) 

  Access Paper or Ask Questions

Gaussian Quadrature for Kernel Features

Jan 31, 2018
Tri Dao, Christopher De Sa, Christopher Ré

Kernel methods have recently attracted resurgent interest, showing performance competitive with deep neural networks in tasks such as speech recognition. The random Fourier features map is a technique commonly used to scale up kernel machines, but employing the randomized feature map means that $O(\epsilon^{-2})$ samples are required to achieve an approximation error of at most $\epsilon$. We investigate some alternative schemes for constructing feature maps that are deterministic, rather than random, by approximating the kernel in the frequency domain using Gaussian quadrature. We show that deterministic feature maps can be constructed, for any $\gamma > 0$, to achieve error $\epsilon$ with $O(e^{e^\gamma} + \epsilon^{-1/\gamma})$ samples as $\epsilon$ goes to 0. Our method works particularly well with sparse ANOVA kernels, which are inspired by the convolutional layer of CNNs. We validate our methods on datasets in different domains, such as MNIST and TIMIT, showing that deterministic features are faster to generate and achieve accuracy comparable to the state-of-the-art kernel methods based on random Fourier features.

* Neural Information Processing Systems (NIPS) 2017 

  Access Paper or Ask Questions

A Sequential Model for Classifying Temporal Relations between Intra-Sentence Events

Jul 23, 2017
Prafulla Kumar Choubey, Ruihong Huang

We present a sequential model for temporal relation classification between intra-sentence events. The key observation is that the overall syntactic structure and compositional meanings of the multi-word context between events are important for distinguishing among fine-grained temporal relations. Specifically, our approach first extracts a sequence of context words that indicates the temporal relation between two events, which well align with the dependency path between two event mentions. The context word sequence, together with a parts-of-speech tag sequence and a dependency relation sequence that are generated corresponding to the word sequence, are then provided as input to bidirectional recurrent neural network (LSTM) models. The neural nets learn compositional syntactic and semantic representations of contexts surrounding the two events and predict the temporal relation between them. Evaluation of the proposed approach on TimeBank corpus shows that sequential modeling is capable of accurately recognizing temporal relations between events, which outperforms a neural net model using various discrete features as input that imitates previous feature based models.

* EMNLP 2017 

  Access Paper or Ask Questions

ASR error management for improving spoken language understanding

May 26, 2017
Edwin Simonnet, Sahar Ghannay, Nathalie Camelin, Yannick Estève, Renato De Mori

This paper addresses the problem of automatic speech recognition (ASR) error detection and their use for improving spoken language understanding (SLU) systems. In this study, the SLU task consists in automatically extracting, from ASR transcriptions , semantic concepts and concept/values pairs in a e.g touristic information system. An approach is proposed for enriching the set of semantic labels with error specific labels and by using a recently proposed neural approach based on word embeddings to compute well calibrated ASR confidence measures. Experimental results are reported showing that it is possible to decrease significantly the Concept/Value Error Rate with a state of the art system, outperforming previously published results performance on the same experimental data. It also shown that combining an SLU approach based on conditional random fields with a neural encoder/decoder attention based architecture , it is possible to effectively identifying confidence islands and uncertain semantic output segments useful for deciding appropriate error handling actions by the dialogue manager strategy .

* Interspeech 2017, Aug 2017, Stockholm, Sweden. 2017 

  Access Paper or Ask Questions

Hierarchical Attention Model for Improved Machine Comprehension of Spoken Content

Jan 01, 2017
Wei Fang, Jui-Yang Hsu, Hung-yi Lee, Lin-Shan Lee

Multimedia or spoken content presents more attractive information than plain text content, but the former is more difficult to display on a screen and be selected by a user. As a result, accessing large collections of the former is much more difficult and time-consuming than the latter for humans. It's therefore highly attractive to develop machines which can automatically understand spoken content and summarize the key information for humans to browse over. In this endeavor, a new task of machine comprehension of spoken content was proposed recently. The initial goal was defined as the listening comprehension test of TOEFL, a challenging academic English examination for English learners whose native languages are not English. An Attention-based Multi-hop Recurrent Neural Network (AMRNN) architecture was also proposed for this task, which considered only the sequential relationship within the speech utterances. In this paper, we propose a new Hierarchical Attention Model (HAM), which constructs multi-hopped attention mechanism over tree-structured rather than sequential representations for the utterances. Improved comprehension performance robust with respect to ASR errors were obtained.

* Copyright 2016 IEEE. Published in the 2016 IEEE Workshop on Spoken Language Technology (SLT 2016) 

  Access Paper or Ask Questions

Incorporating Belief Function in SVM for Phoneme Recognition

Jul 22, 2015
Rimah Amami, Dorra Ben Ayed, Nouerddine Ellouze

The Support Vector Machine (SVM) method has been widely used in numerous classification tasks. The main idea of this algorithm is based on the principle of the margin maximization to find an hyperplane which separates the data into two different classes.In this paper, SVM is applied to phoneme recognition task. However, in many real-world problems, each phoneme in the data set for recognition problems may differ in the degree of significance due to noise, inaccuracies, or abnormal characteristics; All those problems can lead to the inaccuracies in the prediction phase. Unfortunately, the standard formulation of SVM does not take into account all those problems and, in particular, the variation in the speech input. This paper presents a new formulation of SVM (B-SVM) that attributes to each phoneme a confidence degree computed based on its geometric position in the space. Then, this degree is used in order to strengthen the class membership of the tested phoneme. Hence, we introduce a reformulation of the standard SVM that incorporates the degree of belief. Experimental performance on TIMIT database shows the effectiveness of the proposed method B-SVM on a phoneme recognition problem.

* Lecture Notes in Computer Science Volume 8480, 2014, pp 191-199 
* 9th International Conference, Hybrid Artificial Intelligence Systems, Salamanca, Spain, June 11-13, 2014 

  Access Paper or Ask Questions

Visual Passwords Using Automatic Lip Reading

Sep 02, 2014
Ahmad Basheer Hassanat

This paper presents a visual passwords system to increase security. The system depends mainly on recognizing the speaker using the visual speech signal alone. The proposed scheme works in two stages: setting the visual password stage and the verification stage. At the setting stage the visual passwords system request the user to utter a selected password, a video recording of the user face is captured, and processed by a special words-based VSR system which extracts a sequence of feature vectors. In the verification stage, the same procedure is executed, the features will be sent to be compared with the stored visual password. The proposed scheme has been evaluated using a video database of 20 different speakers (10 females and 10 males), and 15 more males in another video database with different experiment sets. The evaluation has proved the system feasibility, with average error rate in the range of 7.63% to 20.51% at the worst tested scenario, and therefore, has potential to be a practical approach with the support of other conventional authentication methods such as the use of usernames and passwords.

* International Journal of Sciences: Basic and Applied Research (IJSBAR) (2014) Volume 13, No 1, pp 218-231 

  Access Paper or Ask Questions

A Multi-Scale Time-Frequency Spectrogram Discriminator for GAN-based Non-Autoregressive TTS

Mar 22, 2022
Haohan Guo, Hui Lu, Xixin Wu, Helen Meng

The generative adversarial network (GAN) has shown its outstanding capability in improving Non-Autoregressive TTS (NAR-TTS) by adversarially training it with an extra model that discriminates between the real and the generated speech. To maximize the benefits of GAN, it is crucial to find a powerful discriminator that can capture rich distinguishable information. In this paper, we propose a multi-scale time-frequency spectrogram discriminator to help NAR-TTS generate high-fidelity Mel-spectrograms. It treats the spectrogram as a 2D image to exploit the correlation among different components in the time-frequency domain. And a U-Net-based model structure is employed to discriminate at different scales to capture both coarse-grained and fine-grained information. We conduct subjective tests to evaluate the proposed approach. Both multi-scale and time-frequency discriminating bring significant improvement in the naturalness and fidelity. When combining the neural vocoder, it is shown more effective and concise than fine-tuning the vocoder. Finally, we visualize the discriminating maps to compare their difference to verify the effectiveness of multi-scale discriminating.

* Submitted to INTERSPEECH 2022 

  Access Paper or Ask Questions

The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

Feb 10, 2022
Maokui He, Xiang Lv, Weilin Zhou, JingJing Yin, Xiaoqi Zhang, Yuxuan Wang, Shutong Niu, Yuhang Cao, Heng Lu, Jun Du, Chin-Hui Lee

We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge. These techniques are designed to handle multi-speaker conversations in real-world meeting scenarios with high speaker-overlap ratios and under heavy reverberant and noisy condition. First, for data preparation and augmentation in training TS-VAD models, speech data containing both real meetings and simulated indoor conversations are used. Second, in refining results obtained after TS-VAD based decoding, we perform a series of post-processing steps to improve the VAD results needed to reduce diarization error rates (DERs). Tested on the ALIMEETING corpus, the newly released Mandarin meeting dataset used in M2MeT, we demonstrate that our proposed system can decrease the DER by up to 66.55/60.59% relatively when compared with classical clustering based diarization on the Eval/Test set.

  Access Paper or Ask Questions