Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

The Grammar of Sense: Is word-sense tagging much more than part-of-speech tagging?

Jul 26, 1996
Yorick Wilks, Mark Stevenson

This squib claims that Large-scale Automatic Sense Tagging of text (LAST) can be done at a high-level of accuracy and with far less complexity and computational effort than has been believed until now. Moreover, it can be done for all open class words, and not just carefully selected opposed pairs as in some recent work. We describe two experiments: one exploring the amount of information relevant to sense disambiguation which is contained in the part-of-speech field of entries in Longman Dictionary of Contemporary English (LDOCE). Another, more practical, experiment attempts sense disambiguation of all open class words in a text assigning LDOCE homographs as sense tags using only part-of-speech information. We report that 92% of open class words can be successfully tagged in this way. We plan to extend this work and to implement an improved large-scale tagger, a description of which is included here.

* 8 pages, LaTeX 

  Access Paper or Ask Questions

UniST: Unified End-to-end Model for Streaming and Non-streaming Speech Translation

Sep 15, 2021
Qianqian Dong, Yaoming Zhu, Mingxuan Wang, Lei Li

This paper presents a unified end-to-end frame-work for both streaming and non-streamingspeech translation. While the training recipes for non-streaming speech translation have been mature, the recipes for streaming speechtranslation are yet to be built. In this work, wefocus on developing a unified model (UniST) which supports streaming and non-streaming ST from the perspective of fundamental components, including training objective, attention mechanism and decoding policy. Experiments on the most popular speech-to-text translation benchmark dataset, MuST-C, show that UniST achieves significant improvement for non-streaming ST, and a better-learned trade-off for BLEU score and latency metrics for streaming ST, compared with end-to-end baselines and the cascaded models. We will make our codes and evaluation tools publicly available.

  Access Paper or Ask Questions

Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model

Nov 02, 2018
Alexander H. Liu, Hung-yi Lee, Lin-shan Lee

In this paper we proposed a novel Adversarial Training (AT) approach for end-to-end speech recognition using a Criticizing Language Model (CLM). In this way the CLM and the automatic speech recognition (ASR) model can challenge and learn from each other iteratively to improve the performance. Since the CLM only takes the text as input, huge quantities of unpaired text data can be utilized in this approach within end-to-end training. Moreover, AT can be applied to any end-to-end ASR model using any deep-learning-based language modeling frameworks, and compatible with any existing end-to-end decoding method. Initial results with an example experimental setup demonstrated the proposed approach is able to gain consistent improvements efficiently from auxiliary text data under different scenarios.

* under review ICASSP 2019 

  Access Paper or Ask Questions

A Bayesian Network View on Acoustic Model-Based Techniques for Robust Speech Recognition

Sep 22, 2014
Roland Maas, Christian Huemmer, Armin Sehr, Walter Kellermann

This article provides a unifying Bayesian network view on various approaches for acoustic model adaptation, missing feature, and uncertainty decoding that are well-known in the literature of robust automatic speech recognition. The representatives of these classes can often be deduced from a Bayesian network that extends the conventional hidden Markov models used in speech recognition. These extensions, in turn, can in many cases be motivated from an underlying observation model that relates clean and distorted feature vectors. By converting the observation models into a Bayesian network representation, we formulate the corresponding compensation rules leading to a unified view on known derivations as well as to new formulations for certain approaches. The generic Bayesian perspective provided in this contribution thus highlights structural differences and similarities between the analyzed approaches.

  Access Paper or Ask Questions

Robust Speech Representation Learning via Flow-based Embedding Regularization

Dec 07, 2021
Woo Hyun Kang, Jahangir Alam, Abderrahim Fathan

Over the recent years, various deep learning-based methods were proposed for extracting a fixed-dimensional embedding vector from speech signals. Although the deep learning-based embedding extraction methods have shown good performance in numerous tasks including speaker verification, language identification and anti-spoofing, their performance is limited when it comes to mismatched conditions due to the variability within them unrelated to the main task. In order to alleviate this problem, we propose a novel training strategy that regularizes the embedding network to have minimum information about the nuisance attributes. To achieve this, our proposed method directly incorporates the information bottleneck scheme into the training process, where the mutual information is estimated using the main task classifier and an auxiliary normalizing flow network. The proposed method was evaluated on different speech processing tasks and showed improvement over the standard training strategy in all experimentation.

  Access Paper or Ask Questions

Fine-grained style control in Transformer-based Text-to-speech Synthesis

Oct 12, 2021
Li-Wei Chen, Alexander Rudnicky

In this paper, we present a novel architecture to realize fine-grained style control on the transformer-based text-to-speech synthesis (TransformerTTS). Specifically, we model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. The existing content encoder in TransformerTTS is then replaced by our designed cross-attention blocks for fusion and alignment between content and style. As the fusion is performed along with the skip connection, our cross-attention block provides a good inductive bias to gradually infuse the phoneme representation with a given style. Additionally, we prevent the style embedding from encoding linguistic content by randomly truncating LST during training and using wav2vec 2.0 features. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability. Our code and samples are publicly available.

* Submitted to ICASSP 2022 

  Access Paper or Ask Questions

Is "moby dick" a Whale or a Bird? Named Entities and Terminology in Speech Translation

Sep 15, 2021
Marco Gaido, Susana Rodríguez, Matteo Negri, Luisa Bentivogli, Marco Turchi

Automatic translation systems are known to struggle with rare words. Among these, named entities (NEs) and domain-specific terms are crucial, since errors in their translation can lead to severe meaning distortions. Despite their importance, previous speech translation (ST) studies have neglected them, also due to the dearth of publicly available resources tailored to their specific evaluation. To fill this gap, we i) present the first systematic analysis of the behavior of state-of-the-art ST systems in translating NEs and terminology, and ii) release NEuRoparl-ST, a novel benchmark built from European Parliament speeches annotated with NEs and terminology. Our experiments on the three language directions covered by our benchmark (en->es/fr/it) show that ST systems correctly translate 75-80% of terms and 65-70% of NEs, with very low performance (37-40%) on person names.

* Accepted at EMNLP2021 

  Access Paper or Ask Questions

A new subband non linear prediction coding algorithm for narrowband speech signal: The nADPCMB MLT coding scheme

Mar 24, 2022
Guido D'Alessandro, Marcos Faundez Zanuy, Francesco Piazza

This paper focuses on a newly developed transparent nADPCMB MLT speech coding algorithm. Our coder first decomposes the narrowband speech signal in subbands, a non linear ADPCM scheme is then performed in each subband. The signal subband decomposition is piloted by the equivalent Modulated Lapped Transform (MLT) filter bank. The novelty of this algorithm is the non linear approach, based on neural networks, to subband prediction coding. We have evaluated the performance of the nADPCMB MLT coding algorithm with a session of formal listening based on the five grade impairment scale standardized within ITU - T Recommendation P.800.

* 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002, pp. I-1025-I-1028 
* 4 pages, published in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing Orlando, FL, USA 

  Access Paper or Ask Questions

Similarity and Content-based Phonetic Self Attention for Speech Recognition

Mar 28, 2022
Kyuhong Shim, Wonyong Sung

Transformer-based speech recognition models have achieved great success due to the self-attention (SA) mechanism that utilizes every frame in the feature extraction process. Especially, SA heads in lower layers capture various phonetic characteristics by the query-key dot product, which is designed to compute the pairwise relationship between frames. In this paper, we propose a variant of SA to extract more representative phonetic features. The proposed phonetic self-attention (phSA) is composed of two different types of phonetic attention; one is similarity-based and the other is content-based. In short, similarity-based attention utilizes the correlation between frames while content-based attention only considers each frame without being affected by others. We identify which parts of the original dot product are related to two different attention patterns and improve each part by simple modifications. Our experiments on phoneme classification and speech recognition show that replacing SA with phSA for lower layers improves the recognition performance without increasing the latency and the parameter size.

* Submitted to INTERSPEECH 2022 

  Access Paper or Ask Questions

Self-Supervised Learning from Contrastive Mixtures for Personalized Speech Enhancement

Nov 06, 2020
Aswin Sivaraman, Minje Kim

This work explores how self-supervised learning can be universally used to discover speaker-specific features towards enabling personalized speech enhancement models. We specifically address the few-shot learning scenario where access to cleaning recordings of a test-time speaker is limited to a few seconds, but noisy recordings of the speaker are abundant. We develop a simple contrastive learning procedure which treats the abundant noisy data as makeshift training targets through pairwise noise injection: the model is pretrained to maximize agreement between pairs of differently deformed identical utterances and to minimize agreement between pairs of similarly deformed nonidentical utterances. Our experiments compare the proposed pretraining approach with two baseline alternatives: speaker-agnostic fully-supervised pretraining, and speaker-specific self-supervised pretraining without contrastive loss terms. Of all three approaches, the proposed method using contrastive mixtures is found to be most robust to model compression (using 85% fewer parameters) and reduced clean speech (requiring only 3 seconds).

* 4 pages, 4 figures, submitted for NeurIPS SAS Workshop 2020 and ICASSP 2021 

  Access Paper or Ask Questions