Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Florian Metze

AlloVera: A Multilingual Allophone Database

Apr 17, 2020

David R. Mortensen, Xinjian Li, Patrick Littell, Alexis Michaud, Shruti Rijhwani, Antonios Anastasopoulos, Alan W. Black, Florian Metze, Graham Neubig

Figure 1 for AlloVera: A Multilingual Allophone Database

Figure 2 for AlloVera: A Multilingual Allophone Database

Figure 3 for AlloVera: A Multilingual Allophone Database

Figure 4 for AlloVera: A Multilingual Allophone Database

Abstract:We introduce a new resource, AlloVera, which provides mappings from 218 allophones to phonemes for 14 languages. Phonemes are contrastive phonological units, and allophones are their various concrete realizations, which are predictable from phonological context. While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a universal (language-independent) transcription. AlloVera allows the training of speech recognition models that output phonetic transcriptions in the International Phonetic Alphabet (IPA), regardless of the input language. We show that a "universal" allophone model, Allosaurus, built with AlloVera, outperforms "universal" phonemic models and language-specific models on a speech-transcription task. We explore the implications of this technology (and related technologies) for the documentation of endangered and minority languages. We further explore other applications for which AlloVera will be suitable as it grows, including phonological typology.

* 8 pages, LREC 2020

Via

Access Paper or Ask Questions

ASR Error Correction and Domain Adaptation Using Machine Translation

Mar 13, 2020

Anirudh Mani, Shruti Palaskar, Nimshi Venkat Meripo, Sandeep Konam, Florian Metze

Figure 1 for ASR Error Correction and Domain Adaptation Using Machine Translation

Figure 2 for ASR Error Correction and Domain Adaptation Using Machine Translation

Figure 3 for ASR Error Correction and Domain Adaptation Using Machine Translation

Figure 4 for ASR Error Correction and Domain Adaptation Using Machine Translation

Abstract:Off-the-shelf pre-trained Automatic Speech Recognition (ASR) systems are an increasingly viable service for companies of any size building speech-based products. While these ASR systems are trained on large amounts of data, domain mismatch is still an issue for many such parties that want to use this service as-is leading to not so optimal results for their task. We propose a simple technique to perform domain adaptation for ASR error correction via machine translation. The machine translation model is a strong candidate to learn a mapping from out-of-domain ASR errors to in-domain terms in the corresponding reference files. We use two off-the-shelf ASR systems in this work: Google ASR (commercial) and the ASPIRE model (open-source). We observe 7% absolute improvement in word error rate and 4 point absolute improvement in BLEU score in Google ASR output via our proposed method. We also evaluate ASR error correction via a downstream task of Speaker Diarization that captures speaker style, syntax, structure and semantic improvements we obtain via ASR correction.

* Accepted for Oral Presentation at ICASSP 2020

Via

Access Paper or Ask Questions

Universal Phone Recognition with a Multilingual Allophone System

Feb 26, 2020

Xinjian Li, Siddharth Dalmia, Juncheng Li, Matthew Lee, Patrick Littell, Jiali Yao, Antonios Anastasopoulos, David R. Mortensen, Graham Neubig, Alan W Black(+1 more)

Figure 1 for Universal Phone Recognition with a Multilingual Allophone System

Figure 2 for Universal Phone Recognition with a Multilingual Allophone System

Figure 3 for Universal Phone Recognition with a Multilingual Allophone System

Figure 4 for Universal Phone Recognition with a Multilingual Allophone System

Abstract:Multilingual models can improve language processing, particularly for low resource situations, by sharing parameters across languages. Multilingual acoustic models, however, generally ignore the difference between phonemes (sounds that can support lexical contrasts in a particular language) and their corresponding phones (the sounds that are actually spoken, which are language independent). This can lead to performance degradation when combining a variety of training languages, as identically annotated phonemes can actually correspond to several different underlying phonetic realizations. In this work, we propose a joint model of both language-independent phone and language-dependent phoneme distributions. In multilingual ASR experiments over 11 languages, we find that this model improves testing performance by 2% phoneme error rate absolute in low-resource conditions. Additionally, because we are explicitly modeling language-independent phones, we can build a (nearly-)universal phone recognizer that, when combined with the PHOIBLE large, manually curated database of phone inventories, can be customized into 2,000 language dependent recognizers. Experiments on two low-resourced indigenous languages, Inuktitut and Tusom, show that our recognizer achieves phone accuracy improvements of more than 17%, moving a step closer to speech recognition for all languages in the world.

* ICASSP 2020

Via

Access Paper or Ask Questions

Towards Zero-shot Learning for Automatic Phonemic Transcription

Feb 26, 2020

Xinjian Li, Siddharth Dalmia, David R. Mortensen, Juncheng Li, Alan W Black, Florian Metze

Figure 1 for Towards Zero-shot Learning for Automatic Phonemic Transcription

Figure 2 for Towards Zero-shot Learning for Automatic Phonemic Transcription

Figure 3 for Towards Zero-shot Learning for Automatic Phonemic Transcription

Figure 4 for Towards Zero-shot Learning for Automatic Phonemic Transcription

Abstract:Automatic phonemic transcription tools are useful for low-resource language documentation. However, due to the lack of training sets, only a tiny fraction of languages have phonemic transcription tools. Fortunately, multilingual acoustic modeling provides a solution given limited audio training data. A more challenging problem is to build phonemic transcribers for languages with zero training data. The difficulty of this task is that phoneme inventories often differ between the training languages and the target language, making it infeasible to recognize unseen phonemes. In this work, we address this problem by adopting the idea of zero-shot learning. Our model is able to recognize unseen phonemes in the target language without any training data. In our model, we decompose phonemes into corresponding articulatory attributes such as vowel and consonant. Instead of predicting phonemes directly, we first predict distributions over articulatory attributes, and then compute phoneme distributions with a customized acoustic model. We evaluate our model by training it using 13 languages and testing it using 7 unseen languages. We find that it achieves 7.7% better phoneme error rate on average over a standard multilingual model.

* AAAI 2020

Via

Access Paper or Ask Questions

Looking Enhances Listening: Recovering Missing Speech Using Images

Feb 13, 2020

Tejas Srinivasan, Ramon Sanabria, Florian Metze

Figure 1 for Looking Enhances Listening: Recovering Missing Speech Using Images

Figure 2 for Looking Enhances Listening: Recovering Missing Speech Using Images

Figure 3 for Looking Enhances Listening: Recovering Missing Speech Using Images

Figure 4 for Looking Enhances Listening: Recovering Missing Speech Using Images

Abstract:Speech is understood better by using visual context; for this reason, there have been many attempts to use images to adapt automatic speech recognition (ASR) systems. Current work, however, has shown that visually adapted ASR models only use images as a regularization signal, while completely ignoring their semantic content. In this paper, we present a set of experiments where we show the utility of the visual modality under noisy conditions. Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations. We observe that integrating visual context can result in up to 35% relative improvement in masked word recovery. These results demonstrate that end-to-end multimodal ASR systems can become more robust to noise by leveraging the visual context.

* Accepted to ICASSP 2020

Via

Access Paper or Ask Questions

Gun Source and Muzzle Head Detection

Jan 29, 2020

Zhong Zhou, Isak Czeresnia Etinger, Florian Metze, Alexander Hauptmann, Alexander Waibel

Figure 1 for Gun Source and Muzzle Head Detection

Figure 2 for Gun Source and Muzzle Head Detection

Figure 3 for Gun Source and Muzzle Head Detection

Figure 4 for Gun Source and Muzzle Head Detection

Abstract:There is a surging need across the world for protection against gun violence. There are three main areas that we have identified as challenging in research that tries to curb gun violence: temporal location of gunshots, gun type prediction and gun source (shooter) detection. Our task is gun source detection and muzzle head detection, where the muzzle head is the round opening of the firing end of the gun. We would like to locate the muzzle head of the gun in the video visually, and identify who has fired the shot. In our formulation, we turn the problem of muzzle head detection into two sub-problems of human object detection and gun smoke detection. Our assumption is that the muzzle head typically lies between the gun smoke caused by the shot and the shooter. We have interesting results both in bounding the shooter as well as detecting the gun smoke. In our experiments, we are successful in detecting the muzzle head by detecting the gun smoke and the shooter.

* EI 2020

Via

Access Paper or Ask Questions

On Compositionality in Neural Machine Translation

Dec 14, 2019

Vikas Raunak, Vaibhav Kumar, Florian Metze

Figure 1 for On Compositionality in Neural Machine Translation

Figure 2 for On Compositionality in Neural Machine Translation

Figure 3 for On Compositionality in Neural Machine Translation

Figure 4 for On Compositionality in Neural Machine Translation

Abstract:We investigate two specific manifestations of compositionality in Neural Machine Translation (NMT) : (1) Productivity - the ability of the model to extend its predictions beyond the observed length in training data and (2) Systematicity - the ability of the model to systematically recombine known parts and rules. We evaluate a standard Sequence to Sequence model on tests designed to assess these two properties in NMT. We quantitatively demonstrate that inadequate temporal processing, in the form of poor encoder representations is a bottleneck for both Productivity and Systematicity. We propose a simple pre-training mechanism which alleviates model performance on the two properties and leads to a significant improvement in BLEU scores.

* Accepted at Context and Compositionality Workshop, NeurIPS 2019

Via

Access Paper or Ask Questions

Adversarial Music: Real World Audio Adversary Against Wake-word Detection System

Dec 06, 2019

Juncheng B. Li, Shuhui Qu, Xinjian Li, Joseph Szurley, J. Zico Kolter, Florian Metze

Figure 1 for Adversarial Music: Real World Audio Adversary Against Wake-word Detection System

Figure 2 for Adversarial Music: Real World Audio Adversary Against Wake-word Detection System

Figure 3 for Adversarial Music: Real World Audio Adversary Against Wake-word Detection System

Figure 4 for Adversarial Music: Real World Audio Adversary Against Wake-word Detection System

Abstract:Voice Assistants (VAs) such as Amazon Alexa or Google Assistant rely on wake-word detection to respond to people's commands, which could potentially be vulnerable to audio adversarial examples. In this work, we target our attack on the wake-word detection system, jamming the model with some inconspicuous background music to deactivate the VAs while our audio adversary is present. We implemented an emulated wake-word detection system of Amazon Alexa based on recent publications. We validated our models against the real Alexa in terms of wake-word detection accuracy. Then we computed our audio adversaries with consideration of expectation over transform and we implemented our audio adversary with a differentiable synthesizer. Next, we verified our audio adversaries digitally on hundreds of samples of utterances collected from the real world. Our experiments show that we can effectively reduce the recognition F1 score of our emulated model from 93.4% to 11.0%. Finally, we tested our audio adversary over the air, and verified it works effectively against Alexa, reducing its F1 score from 92.5% to 11.0%.; We also verified that non-adversarial music does not disable Alexa as effectively as our music at the same sound level. To the best of our knowledge, this is the first real-world adversarial attack against a commercial-grade VA wake-word detection system. Our code and demo videos can be accessed at \url{https://www.junchengbillyli.com/AdversarialMusic}

* NIPS2019_9362, pages = {11908--11918}, year = {2019}, publisher = {Curran Associates, Inc.}, url = {http://papers.nips.cc/paper/9362-adversarial-music-real-world-audio-adversary-against-wake-word-detection-system.pdf} }
* 9 pages, In Proceedings of NeurIPS 2019 Conference

Via

Access Paper or Ask Questions

Enforcing Encoder-Decoder Modularity in Sequence-to-Sequence Models

Nov 09, 2019

Siddharth Dalmia, Abdelrahman Mohamed, Mike Lewis, Florian Metze, Luke Zettlemoyer

Figure 1 for Enforcing Encoder-Decoder Modularity in Sequence-to-Sequence Models

Figure 2 for Enforcing Encoder-Decoder Modularity in Sequence-to-Sequence Models

Figure 3 for Enforcing Encoder-Decoder Modularity in Sequence-to-Sequence Models

Figure 4 for Enforcing Encoder-Decoder Modularity in Sequence-to-Sequence Models

Abstract:Inspired by modular software design principles of independence, interchangeability, and clarity of interface, we introduce a method for enforcing encoder-decoder modularity in seq2seq models without sacrificing the overall model quality or its full differentiability. We discretize the encoder output units into a predefined interpretable vocabulary space using the Connectionist Temporal Classification (CTC) loss. Our modular systems achieve near SOTA performance on the 300h Switchboard benchmark, with WER of 8.3% and 17.6% on the SWB and CH subsets, using seq2seq models with encoder and decoder modules which are independent and interchangeable.

Via

Access Paper or Ask Questions

Multitask Learning For Different Subword Segmentations In Neural Machine Translation

Oct 27, 2019

Tejas Srinivasan, Ramon Sanabria, Florian Metze

Figure 1 for Multitask Learning For Different Subword Segmentations In Neural Machine Translation

Figure 2 for Multitask Learning For Different Subword Segmentations In Neural Machine Translation

Figure 3 for Multitask Learning For Different Subword Segmentations In Neural Machine Translation

Figure 4 for Multitask Learning For Different Subword Segmentations In Neural Machine Translation

Abstract:In Neural Machine Translation (NMT) the usage of subwords and characters as source and target units offers a simple and flexible solution for translation of rare and unseen words. However, selecting the optimal subword segmentation involves a trade-off between expressiveness and flexibility, and is language and dataset-dependent. We present Block Multitask Learning (BMTL), a novel NMT architecture that predicts multiple targets of different granularities simultaneously, removing the need to search for the optimal segmentation strategy. Our multi-task model exhibits improvements of up to 1.7 BLEU points on each decoder over single-task baseline models with the same number of parameters on datasets from two language pairs of IWSLT15 and one from IWSLT19. The multiple hypotheses generated at different granularities can be combined as a post-processing step to give better translations, which improves over hypothesis combination from baseline models while using substantially fewer parameters.

* Accepted to 16th International Workshop on Spoken Language Translation (IWSLT) 2019

Via

Access Paper or Ask Questions