Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Status of the XTAG System

Nov 03, 1994
Christy Doran, Dania Egedi, Beth Ann Hockey, B. Srinivas

XTAG is an ongoing project to develop a wide-coverage grammar for English, based on the Feature-based Lexicalized Tree Adjoining Grammar (FB-LTAG) formalism. The XTAG system integrates a morphological analyzer, an N-best part-of-speech tagger, an Early-style parser and an X-window interface, along with a wide-coverage grammar for English developed using the system. This system serves as a linguist's workbench for developing FB-LTAG specifications. This paper presents a description of and recent improvements to the various components of the XTAG system. It also presents the recent performance of the wide-coverage grammar on various corpora and compares it against the performance of other wide-coverage and domain-specific grammars.

* Proceedings of TAG+3, 1994 
* uuencoded compressed ps file. 4 pages 

  Access Paper or Ask Questions

Defining maximum acceptable latency of AI-enhanced CAI tools

Jan 08, 2022
Claudio Fantinuoli, Maddalena Montecchio

Recent years have seen an increasing number of studies around the design of computer-assisted interpreting tools with integrated automatic speech processing and their use by trainees and professional interpreters. This paper discusses the role of system latency of such tools and presents the results of an experiment designed to investigate the maximum system latency that is cognitively acceptable for interpreters working in the simultaneous modality. The results show that interpreters can cope with a system latency of 3 seconds without any major impact in the rendition of the original text, both in terms of accuracy and fluency. This value is above the typical latency of available AI-based CAI tools and paves the way to experiment with larger context-based language models and higher latencies.

* Accepted at techLing2021 

  Access Paper or Ask Questions

Impact of Target Word and Context on End-to-End Metonymy Detection

Dec 06, 2021
Kevin Alex Mathews, Michael Strube

Metonymy is a figure of speech in which an entity is referred to by another related entity. The task of metonymy detection aims to distinguish metonymic tokens from literal ones. Until now, metonymy detection methods attempt to disambiguate only a single noun phrase in a sentence, typically location names or organization names. In this paper, we disambiguate every word in a sentence by reformulating metonymy detection as a sequence labeling task. We also investigate the impact of target word and context on metonymy detection. We show that the target word is less useful for detecting metonymy in our dataset. On the other hand, the entity types that are associated with domain-specific words in their context are easier to solve. This shows that the context words are much more relevant for detecting metonymy.

  Access Paper or Ask Questions

Comparing Machine Learning-Centered Approaches for Forecasting Language Patterns During Frustration in Early Childhood

Oct 29, 2021
Arnav Bhakta, Yeunjoo Kim, Pamela Cole

When faced with self-regulation challenges, children have been known the use their language to inhibit their emotions and behaviors. Yet, to date, there has been a critical lack of evidence regarding what patterns in their speech children use during these moments of frustration. In this paper, eXtreme Gradient Boosting, Random Forest, Long Short-Term Memory Recurrent Neural Networks, and Elastic Net Regression, have all been used to forecast these language patterns in children. Based on the results of a comparative analysis between these methods, the study reveals that when dealing with high-dimensional and dense data, with very irregular and abnormal distributions, as is the case with self-regulation patterns in children, decision tree-based algorithms are able to outperform traditional regression and neural network methods in their shortcomings.

* 9 pages, 6 figures, UNDER REVIEW, UNPUBLISHED 

  Access Paper or Ask Questions

Simple and Effective Zero-shot Cross-lingual Phoneme Recognition

Sep 23, 2021
Qiantong Xu, Alexei Baevski, Michael Auli

Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages. This is done by mapping phonemes of the training languages to the target language using articulatory features. Experiments show that this simple method significantly outperforms prior work which introduced task-specific architectures and used only part of a monolingually pretrained model.

  Access Paper or Ask Questions

Coarse-To-Fine And Cross-Lingual ASR Transfer

Sep 02, 2021
Peter Polák, Ondřej Bojar

End-to-end neural automatic speech recognition systems achieved recently state-of-the-art results, but they require large datasets and extensive computing resources. Transfer learning has been proposed to overcome these difficulties even across languages, e.g., German ASR trained from an English model. We experiment with much less related languages, reusing an English model for Czech ASR. To simplify the transfer, we propose to use an intermediate alphabet, Czech without accents, and document that it is a highly effective strategy. The technique is also useful on Czech data alone, in the style of coarse-to-fine training. We achieve substantial eductions in training time as well as word error rate (WER).

* Accepted to ITAT WAFNL 

  Access Paper or Ask Questions

Spoken Term Detection Methods for Sparse Transcription in Very Low-resource Settings

Jun 11, 2021
Éric Le Ferrand, Steven Bird, Laurent Besacier

We investigate the efficiency of two very different spoken term detection approaches for transcription when the available data is insufficient to train a robust ASR system. This work is grounded in very low-resource language documentation scenario where only few minutes of recording have been transcribed for a given language so far.Experiments on two oral languages show that a pretrained universal phone recognizer, fine-tuned with only a few minutes of target language speech, can be used for spoken term detection with a better overall performance than a dynamic time warping approach. In addition, we show that representing phoneme recognition ambiguity in a graph structure can further boost the recall while maintaining high precision in the low resource spoken term detection task.

  Access Paper or Ask Questions

Semantic-WER: A Unified Metric for the Evaluation of ASR Transcript for End Usability

Jun 03, 2021
Somnath Roy

Recent advances in supervised, semi-supervised and self-supervised deep learning algorithms have shown significant improvement in the performance of automatic speech recognition(ASR) systems. The state-of-the-art systems have achieved a word error rate (WER) less than 5%. However, in the past, researchers have argued the non-suitability of the WER metric for the evaluation of ASR systems for downstream tasks such as spoken language understanding (SLU) and information retrieval. The reason is that the WER works at the surface level and does not include any syntactic and semantic knowledge.The current work proposes Semantic-WER (SWER), a metric to evaluate the ASR transcripts for downstream applications in general. The SWER can be easily customized for any down-stream task.

  Access Paper or Ask Questions

Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models

Apr 16, 2021
Magnus Jacobsen, Mikkel H. Sørensen, Leon Derczynski

Improvement in machine learning-based NLP performance are often presented with bigger models and more complex code. This presents a trade-off: better scores come at the cost of larger tools; bigger models tend to require more during training and inference time. We present multiple methods for measuring the size of a model, and for comparing this with the model's performance. In a case study over part-of-speech tagging, we then apply these techniques to taggers for eight languages and present a novel analysis identifying which taggers are size-performance optimal. Results indicate that some classical taggers place on the size-performance skyline across languages. Further, although the deep models have highest performance for multiple scores, it is often not the most complex of these that reach peak performance.

  Access Paper or Ask Questions

Adapting Speaker Embeddings for Speaker Diarisation

Apr 07, 2021
Youngki Kwon, Jee-weon Jung, Hee-Soo Heo, You Jin Kim, Bong-Jin Lee, Joon Son Chung

The goal of this paper is to adapt speaker embeddings for solving the problem of speaker diarisation. The quality of speaker embeddings is paramount to the performance of speaker diarisation systems. Despite this, prior works in the field have directly used embeddings designed only to be effective on the speaker verification task. In this paper, we propose three techniques that can be used to better adapt the speaker embeddings for diarisation: dimensionality reduction, attention-based embedding aggregation, and non-speech clustering. A wide range of experiments is performed on various challenging datasets. The results demonstrate that all three techniques contribute positively to the performance of the diarisation system achieving an average relative improvement of 25.07% in terms of diarisation error rate over the baseline.

* 5 pages, 2 figures, 3 tables, submitted to Interspeech as a conference paper 

  Access Paper or Ask Questions