Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Florian Metze

The ARIEL-CMU Systems for LoReHLT18

Feb 24, 2019

Aditi Chaudhary, Siddharth Dalmia, Junjie Hu, Xinjian Li, Austin Matthews, Aldrian Obaja Muis, Naoki Otani, Shruti Rijhwani, Zaid Sheikh, Nidhi Vyas(+20 more)

Figure 1 for The ARIEL-CMU Systems for LoReHLT18

Figure 2 for The ARIEL-CMU Systems for LoReHLT18

Figure 3 for The ARIEL-CMU Systems for LoReHLT18

Figure 4 for The ARIEL-CMU Systems for LoReHLT18

Abstract:This paper describes the ARIEL-CMU submissions to the Low Resource Human Language Technologies (LoReHLT) 2018 evaluations for the tasks Machine Translation (MT), Entity Discovery and Linking (EDL), and detection of Situation Frames in Text and Speech (SF Text and Speech).

Via

Access Paper or Ask Questions

Phoneme Level Language Models for Sequence Based Low Resource ASR

Feb 20, 2019

Siddharth Dalmia, Xinjian Li, Alan W Black, Florian Metze

Figure 1 for Phoneme Level Language Models for Sequence Based Low Resource ASR

Figure 2 for Phoneme Level Language Models for Sequence Based Low Resource ASR

Figure 3 for Phoneme Level Language Models for Sequence Based Low Resource ASR

Figure 4 for Phoneme Level Language Models for Sequence Based Low Resource ASR

Abstract:Building multilingual and crosslingual models help bring different languages together in a language universal space. It allows models to share parameters and transfer knowledge across languages, enabling faster and better adaptation to a new language. These approaches are particularly useful for low resource languages. In this paper, we propose a phoneme-level language model that can be used multilingually and for crosslingual adaptation to a target language. We show that our model performs almost as well as the monolingual models by using six times fewer parameters, and is capable of better adaptation to languages not seen during training in a low resource scenario. We show that these phoneme-level language models can be used to decode sequence based Connectionist Temporal Classification (CTC) acoustic model outputs to obtain comparable word error rates with Weighted Finite State Transducer (WFST) based decoding in Babel languages. We also show that these phoneme-level language models outperform WFST decoding in various low-resource conditions like adapting to a new language and domain mismatch between training and testing data.

* To appear in ICASSP 2019

Via

Access Paper or Ask Questions

Learned In Speech Recognition: Contextual Acoustic Word Embeddings

Feb 18, 2019

Shruti Palaskar, Vikas Raunak, Florian Metze

Figure 1 for Learned In Speech Recognition: Contextual Acoustic Word Embeddings

Figure 2 for Learned In Speech Recognition: Contextual Acoustic Word Embeddings

Figure 3 for Learned In Speech Recognition: Contextual Acoustic Word Embeddings

Figure 4 for Learned In Speech Recognition: Contextual Acoustic Word Embeddings

Abstract:End-to-end acoustic-to-word speech recognition models have recently gained popularity because they are easy to train, scale well to large amounts of training data, and do not require a lexicon. In addition, word models may also be easier to integrate with downstream tasks such as spoken language understanding, because inference (search) is much simplified compared to phoneme, character or any other sort of sub-word units. In this paper, we describe methods to construct contextual acoustic word embeddings directly from a supervised sequence-to-sequence acoustic-to-word speech recognition model using the learned attention distribution. On a suite of 16 standard sentence evaluation tasks, our embeddings show competitive performance against a word2vec model trained on the speech transcriptions. In addition, we evaluate these embeddings on a spoken language understanding task, and observe that our embeddings match the performance of text-based embeddings in a pipeline of first performing speech recognition and then constructing word embeddings from transcriptions.

* Accepted at ICASSP 2019, 5 pages, 1 figure, 3 tables

Via

Access Paper or Ask Questions

Learning from Multiview Correlations in Open-Domain Videos

Nov 21, 2018

Nils Holzenberger, Shruti Palaskar, Pranava Madhyastha, Florian Metze, Raman Arora

Figure 1 for Learning from Multiview Correlations in Open-Domain Videos

Figure 2 for Learning from Multiview Correlations in Open-Domain Videos

Figure 3 for Learning from Multiview Correlations in Open-Domain Videos

Figure 4 for Learning from Multiview Correlations in Open-Domain Videos

Abstract:An increasing number of datasets contain multiple views, such as video, sound and automatic captions. A basic challenge in representation learning is how to leverage multiple views to learn better representations. This is further complicated by the existence of a latent alignment between views, such as between speech and its transcription, and by the multitude of choices for the learning objective. We explore an advanced, correlation-based representation learning method on a 4-way parallel, multimodal dataset, and assess the quality of the learned representations on retrieval-based tasks. We show that the proposed approach produces rich representations that capture most of the information shared across views. Our best models for speech and textual modalities achieve retrieval rates from 70.7% to 96.9% on open-domain, user-generated instructional videos. This shows it is possible to learn reliable representations across disparate, unaligned and noisy modalities, and encourages using the proposed approach on larger datasets.

Via

Access Paper or Ask Questions

Multimodal Grounding for Sequence-to-Sequence Speech Recognition

Nov 09, 2018

Ozan Caglayan, Ramon Sanabria, Shruti Palaskar, Loïc Barrault, Florian Metze

Figure 1 for Multimodal Grounding for Sequence-to-Sequence Speech Recognition

Figure 2 for Multimodal Grounding for Sequence-to-Sequence Speech Recognition

Figure 3 for Multimodal Grounding for Sequence-to-Sequence Speech Recognition

Figure 4 for Multimodal Grounding for Sequence-to-Sequence Speech Recognition

Abstract:Humans are capable of processing speech by making use of multiple sensory modalities. For example, the environment where a conversation takes place generally provides semantic and/or acoustic context that helps us to resolve ambiguities or to recall named entities. Motivated by this, there have been many works studying the integration of visual information into the speech recognition pipeline. Specifically, in our previous work, we propose a multistep visual adaptive training approach which improves the accuracy of an audio-based Automatic Speech Recognition (ASR) system. This approach, however, is not end-to-end as it requires fine-tuning the whole model with an adaptation layer. In this paper, we propose novel end-to-end multimodal ASR systems and compare them to the adaptive approach by using a range of visual representations obtained from state-of-the-art convolutional neural networks. We show that adaptive training is effective for S2S models leading to an absolute improvement of 1.4% in word error rate. As for the end-to-end systems, although they perform better than baseline, the improvements are slightly less than adaptive training, 0.8 absolute WER reduction in single-best models. Using ensemble decoding, end-to-end models reach a WER of 15% which is the lowest score among all systems.

* Submitted to ICASSP 2019

Via

Access Paper or Ask Questions

How2: A Large-scale Dataset for Multimodal Language Understanding

Nov 01, 2018

Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, Florian Metze

Figure 1 for How2: A Large-scale Dataset for Multimodal Language Understanding

Figure 2 for How2: A Large-scale Dataset for Multimodal Language Understanding

Figure 3 for How2: A Large-scale Dataset for Multimodal Language Understanding

Figure 4 for How2: A Large-scale Dataset for Multimodal Language Understanding

Abstract:In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations. We also present integrated sequence-to-sequence baselines for machine translation, automatic speech recognition, spoken language translation, and multimodal summarization. By making available data and code for several multimodal natural language tasks, we hope to stimulate more research on these and similar challenges, to obtain a deeper understanding of multimodality in language processing.

Via

Access Paper or Ask Questions

Domain Robust Feature Extraction for Rapid Low Resource ASR Development

Sep 30, 2018

Siddharth Dalmia, Xinjian Li, Florian Metze, Alan W. Black

Figure 1 for Domain Robust Feature Extraction for Rapid Low Resource ASR Development

Figure 2 for Domain Robust Feature Extraction for Rapid Low Resource ASR Development

Figure 3 for Domain Robust Feature Extraction for Rapid Low Resource ASR Development

Figure 4 for Domain Robust Feature Extraction for Rapid Low Resource ASR Development

Abstract:Developing a practical speech recognizer for a low resource language is challenging, not only because of the (potentially unknown) properties of the language, but also because test data may not be from the same domain as the available training data. In this paper, we focus on the latter challenge, i.e. domain mismatch, for systems trained using a sequence-based criterion. We demonstrate the effectiveness of using a pre-trained English recognizer, which is robust to such mismatched conditions, as a domain normalizing feature extractor on a low resource language. In our example, we use Turkish Conversational Speech and Broadcast News data. This enables rapid development of speech recognizers for new languages which can easily adapt to any domain. Testing in various cross-domain scenarios, we achieve relative improvements of around 25% in phoneme error rate, with improvements being around 50% for some domains.

* To appear in SLT 2018

Via

Access Paper or Ask Questions

Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset

Sep 13, 2018

Ankit Shah, Harini Kesavamoorthy, Poorva Rane, Pramati Kalwad, Alexander Hauptmann, Florian Metze

Figure 1 for Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset

Figure 2 for Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset

Figure 3 for Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset

Figure 4 for Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset

Abstract:Moments capture a huge part of our lives. Accurate recognition of these moments is challenging due to the diverse and complex interpretation of the moments. Action recognition refers to the act of classifying the desired action/activity present in a given video. In this work, we perform experiments on Moments in Time dataset to recognize accurately activities occurring in 3 second clips. We use state of the art techniques for visual, auditory and spatio temporal localization and develop method to accurately classify the activity in the Moments in Time dataset. Our novel approach of using Visual Based Textual features and fusion techniques performs well providing an overall 89.23 % Top - 5 accuracy on the 20 classes - a significant improvement over the Baseline TRN model.

* Action recognition submission for Moments in Time Dataset - Improved results over challenge submission

Via

Access Paper or Ask Questions

Acoustic-to-Word Recognition with Sequence-to-Sequence Models

Aug 21, 2018

Shruti Palaskar, Florian Metze

Figure 1 for Acoustic-to-Word Recognition with Sequence-to-Sequence Models

Figure 2 for Acoustic-to-Word Recognition with Sequence-to-Sequence Models

Figure 3 for Acoustic-to-Word Recognition with Sequence-to-Sequence Models

Figure 4 for Acoustic-to-Word Recognition with Sequence-to-Sequence Models

Abstract:Acoustic-to-Word recognition provides a straightforward solution to end-to-end speech recognition without needing external decoding, language model re-scoring or lexicon. While character-based models offer a natural solution to the out-of-vocabulary problem, word models can be simpler to decode and may also be able to directly recognize semantically meaningful units. We present effective methods to train Sequence-to-Sequence models for direct word-level recognition (and character-level recognition) and show an absolute improvement of 4.4-5.0\% in Word Error Rate on the Switchboard corpus compared to prior work. In addition to these promising results, word-based models are more interpretable than character models, which have to be composed into words using a separate decoding step. We analyze the encoder hidden states and the attention behavior, and show that location-aware attention naturally represents words as a single speech-word-vector, despite spanning multiple frames in the input. We finally show that the Acoustic-to-Word model also learns to segment speech into words with a mean standard deviation of 3 frames as compared with human annotated forced-alignments for the Switchboard corpus.

* 9 pages, 3 figures, Under Review at SLT 2018

Via

Access Paper or Ask Questions

Dialog-context aware end-to-end speech recognition

Aug 07, 2018

Suyoun Kim, Florian Metze

Figure 1 for Dialog-context aware end-to-end speech recognition

Figure 2 for Dialog-context aware end-to-end speech recognition

Figure 3 for Dialog-context aware end-to-end speech recognition

Figure 4 for Dialog-context aware end-to-end speech recognition

Abstract:Existing speech recognition systems are typically built at the sentence level, although it is known that dialog context, e.g. higher-level knowledge that spans across sentences or speakers, can help the processing of long conversations. The recent progress in end-to-end speech recognition systems promises to integrate all available information (e.g. acoustic, language resources) into a single model, which is then jointly optimized. It seems natural that such dialog context information should thus also be integrated into the end-to-end models to improve further recognition accuracy. In this work, we present a dialog-context aware speech recognition model, which explicitly uses context information beyond sentence-level information, in an end-to-end fashion. Our dialog-context model captures a history of sentence-level context so that the whole system can be trained with dialog-context information in an end-to-end manner. We evaluate our proposed approach on the Switchboard conversational speech corpus and show that our system outperforms a comparable sentence-level end-to-end speech recognition system.

* submitted to SLT

Via

Access Paper or Ask Questions