Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Florian Metze

Subword and Crossword Units for CTC Acoustic Models

Jun 18, 2018

Thomas Zenkel, Ramon Sanabria, Florian Metze, Alex Waibel

Figure 1 for Subword and Crossword Units for CTC Acoustic Models

Figure 2 for Subword and Crossword Units for CTC Acoustic Models

Figure 3 for Subword and Crossword Units for CTC Acoustic Models

Figure 4 for Subword and Crossword Units for CTC Acoustic Models

Abstract:This paper proposes a novel approach to create an unit set for CTC based speech recognition systems. By using Byte Pair Encoding we learn an unit set of an arbitrary size on a given training text. In contrast to using characters or words as units this allows us to find a good trade-off between the size of our unit set and the available training data. We evaluate both Crossword units, that may span multiple word, and Subword units. By combining this approach with decoding methods using a separate language model we are able to achieve state of the art results for grapheme based CTC systems.

* Current version accepted at Interspeech 2018

Via

Access Paper or Ask Questions

End-to-End Multimodal Speech Recognition

Apr 25, 2018

Shruti Palaskar, Ramon Sanabria, Florian Metze

Figure 1 for End-to-End Multimodal Speech Recognition

Abstract:Transcription or sub-titling of open-domain videos is still a challenging domain for Automatic Speech Recognition (ASR) due to the data's challenging acoustics, variable signal processing and the essentially unrestricted domain of the data. In previous work, we have shown that the visual channel -- specifically object and scene features -- can help to adapt the acoustic model (AM) and language model (LM) of a recognizer, and we are now expanding this work to end-to-end approaches. In the case of a Connectionist Temporal Classification (CTC)-based approach, we retain the separation of AM and LM, while for a sequence-to-sequence (S2S) approach, both information sources are adapted together, in a single model. This paper also analyzes the behavior of CTC and S2S models on noisy video data (How-To corpus), and compares it to results on the clean Wall Street Journal (WSJ) corpus, providing insight into the robustness of both approaches.

* 5 pages, 5 figures, Accepted at IEEE International Conference on Acoustics, Speech and Signal Processing 2018 (ICASSP 2018)

Via

Access Paper or Ask Questions

Sequence-based Multi-lingual Low Resource Speech Recognition

Mar 06, 2018

Siddharth Dalmia, Ramon Sanabria, Florian Metze, Alan W. Black

Figure 1 for Sequence-based Multi-lingual Low Resource Speech Recognition

Abstract:Techniques for multi-lingual and cross-lingual speech recognition can help in low resource scenarios, to bootstrap systems and enable analysis of new languages and domains. End-to-end approaches, in particular sequence-based techniques, are attractive because of their simplicity and elegance. While it is possible to integrate traditional multi-lingual bottleneck feature extractors as front-ends, we show that end-to-end multi-lingual training of sequence models is effective on context independent models trained using Connectionist Temporal Classification (CTC) loss. We show that our model improves performance on Babel languages by over 6% absolute in terms of word/phoneme error rate when compared to mono-lingual systems built in the same setting for these languages. We also show that the trained model can be adapted cross-lingually to an unseen language using just 25% of the target data. We show that training on multiple languages is important for very low resource cross-lingual target scenarios, but not for multi-lingual testing scenarios. Here, it appears beneficial to include large well prepared datasets.

* 5 pages, 5 figures, to appear in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018)

Via

Access Paper or Ask Questions

Annotating High-Level Structures of Short Stories and Personal Anecdotes

Feb 27, 2018

Boyang Li, Beth Cardier, Tong Wang, Florian Metze

Figure 1 for Annotating High-Level Structures of Short Stories and Personal Anecdotes

Figure 2 for Annotating High-Level Structures of Short Stories and Personal Anecdotes

Figure 3 for Annotating High-Level Structures of Short Stories and Personal Anecdotes

Figure 4 for Annotating High-Level Structures of Short Stories and Personal Anecdotes

Abstract:Stories are a vital form of communication in human culture; they are employed daily to persuade, to elicit sympathy, or to convey a message. Computational understanding of human narratives, especially high-level narrative structures, remain limited to date. Multiple literary theories for narrative structures exist, but operationalization of the theories has remained a challenge. We developed an annotation scheme by consolidating and extending existing narratological theories, including Labov and Waletsky's (1967) functional categorization scheme and Freytag's (1863) pyramid of dramatic tension, and present 360 annotated short stories collected from online sources. In the future, this research will support an approach that enables systems to intelligently sustain complex communications with humans.

* 7 pages, 2 figures and 3 tables. Accepted at the LREC 2018 conference

Via

Access Paper or Ask Questions

Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

Feb 14, 2018

Odette Scharenborg, Laurent Besacier, Alan Black, Mark Hasegawa-Johnson, Florian Metze, Graham Neubig, Sebastian Stueker, Pierre Godard, Markus Mueller, Lucas Ondel(+9 more)

Figure 1 for Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

Figure 2 for Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

Figure 3 for Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop

Abstract:We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding the discovery of linguistic units (subwords and words) in a language without orthography. We study the replacement of orthographic transcriptions by images and/or translated text in a well-resourced language to help unsupervised discovery from raw speech.

* Accepted to ICASSP 2018

Via

Access Paper or Ask Questions

Visual Features for Context-Aware Speech Recognition

Dec 01, 2017

Abhinav Gupta, Yajie Miao, Leonardo Neves, Florian Metze

Figure 1 for Visual Features for Context-Aware Speech Recognition

Figure 2 for Visual Features for Context-Aware Speech Recognition

Figure 3 for Visual Features for Context-Aware Speech Recognition

Abstract:Automatic transcriptions of consumer-generated multi-media content such as "Youtube" videos still exhibit high word error rates. Such data typically occupies a very broad domain, has been recorded in challenging conditions, with cheap hardware and a focus on the visual modality, and may have been post-processed or edited. In this paper, we extend our earlier work on adapting the acoustic model of a DNN-based speech recognition system to an RNN language model and show how both can be adapted to the objects and scenes that can be automatically detected in the video. We are working on a corpus of "how-to" videos from the web, and the idea is that an object that can be seen ("car"), or a scene that is being detected ("kitchen") can be used to condition both models on the "context" of the recording, thereby reducing perplexity and improving transcription. We achieve good improvements in both cases and compare and analyze the respective reductions in word error rate. We expect that our results can be used for any type of speech processing in which "context" information is available, for example in robotics, man-machine interaction, or when indexing large audio-visual archives, and should ultimately help to bring together the "video-to-text" and "speech-to-text" communities.

* IEEE Xplore (ICASSP) (2017) 5020-5024
* 5 pages and 3 figures

Via

Access Paper or Ask Questions

Comparison of Decoding Strategies for CTC Acoustic Models

Aug 15, 2017

Thomas Zenkel, Ramon Sanabria, Florian Metze, Jan Niehues, Matthias Sperber, Sebastian Stüker, Alex Waibel

Figure 1 for Comparison of Decoding Strategies for CTC Acoustic Models

Figure 2 for Comparison of Decoding Strategies for CTC Acoustic Models

Figure 3 for Comparison of Decoding Strategies for CTC Acoustic Models

Abstract:Connectionist Temporal Classification has recently attracted a lot of interest as it offers an elegant approach to building acoustic models (AMs) for speech recognition. The CTC loss function maps an input sequence of observable feature vectors to an output sequence of symbols. Output symbols are conditionally independent of each other under CTC loss, so a language model (LM) can be incorporated conveniently during decoding, retaining the traditional separation of acoustic and linguistic components in ASR. For fixed vocabularies, Weighted Finite State Transducers provide a strong baseline for efficient integration of CTC AMs with n-gram LMs. Character-based neural LMs provide a straight forward solution for open vocabulary speech recognition and all-neural models, and can be decoded with beam search. Finally, sequence-to-sequence models can be used to translate a sequence of individual sounds into a word string. We compare the performance of these three approaches, and analyze their error patterns, which provides insightful guidance for future research and development in this important area.

* 5 pages. To appear in Interspeech 2017

Via

Access Paper or Ask Questions

A Comparison of deep learning methods for environmental sound

Mar 20, 2017

Juncheng Li, Wei Dai, Florian Metze, Shuhui Qu, Samarjit Das

Figure 1 for A Comparison of deep learning methods for environmental sound

Figure 2 for A Comparison of deep learning methods for environmental sound

Figure 3 for A Comparison of deep learning methods for environmental sound

Figure 4 for A Comparison of deep learning methods for environmental sound

Abstract:Environmental sound detection is a challenging application of machine learning because of the noisy nature of the signal, and the small amount of (labeled) data that is typically available. This work thus presents a comparison of several state-of-the-art Deep Learning models on the IEEE challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 challenge task and data, classifying sounds into one of fifteen common indoor and outdoor acoustic scenes, such as bus, cafe, car, city center, forest path, library, train, etc. In total, 13 hours of stereo audio recordings are available, making this one of the largest datasets available. We perform experiments on six sets of features, including standard Mel-frequency cepstral coefficients (MFCC), Binaural MFCC, log Mel-spectrum and two different large- scale temporal pooling features extracted using OpenSMILE. On these features, we apply five models: Gaussian Mixture Model (GMM), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Convolutional Deep Neural Net- work (CNN) and i-vector. Using the late-fusion approach, we improve the performance of the baseline 72.5% by 15.6% in 4-fold Cross Validation (CV) avg. accuracy and 11% in test accuracy, which matches the best result of the DCASE 2016 challenge. With large feature sets, deep neural network models out- perform traditional methods and achieve the best performance among all the studied methods. Consistent with other work, the best performing single model is the non-temporal DNN model, which we take as evidence that sounds in the DCASE challenge do not exhibit strong temporal dynamics.

* published at ICASSP 2017
* 5 pages including reference

Via

Access Paper or Ask Questions

Robust end-to-end deep audiovisual speech recognition

Nov 21, 2016

Ramon Sanabria, Florian Metze, Fernando De La Torre

Figure 1 for Robust end-to-end deep audiovisual speech recognition

Figure 2 for Robust end-to-end deep audiovisual speech recognition

Figure 3 for Robust end-to-end deep audiovisual speech recognition

Figure 4 for Robust end-to-end deep audiovisual speech recognition

Abstract:Speech is one of the most effective ways of communication among humans. Even though audio is the most common way of transmitting speech, very important information can be found in other modalities, such as vision. Vision is particularly useful when the acoustic signal is corrupted. Multi-modal speech recognition however has not yet found wide-spread use, mostly because the temporal alignment and fusion of the different information sources is challenging. This paper presents an end-to-end audiovisual speech recognizer (AVSR), based on recurrent neural networks (RNN) with a connectionist temporal classification (CTC) loss function. CTC creates sparse "peaky" output activations, and we analyze the differences in the alignments of output targets (phonemes or visemes) between audio-only, video-only, and audio-visual feature representations. We present the first such experiments on the large vocabulary IBM ViaVoice database, which outperform previously published approaches on phone accuracy in clean and noisy conditions.

Via

Access Paper or Ask Questions

EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

Oct 18, 2015

Yajie Miao, Mohammad Gowayyed, Florian Metze

Figure 1 for EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

Figure 2 for EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

Figure 3 for EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

Figure 4 for EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding

Abstract:The performance of automatic speech recognition (ASR) has improved tremendously due to the application of deep neural networks (DNNs). Despite this progress, building a new ASR system remains a challenging task, requiring various resources, multiple training stages and significant expertise. This paper presents our Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems. Acoustic modeling in Eesen involves learning a single recurrent neural network (RNN) predicting context-independent targets (phonemes or characters). To remove the need for pre-generated frame labels, we adopt the connectionist temporal classification (CTC) objective function to infer the alignments between speech and label sequences. A distinctive feature of Eesen is a generalized decoding approach based on weighted finite-state transducers (WFSTs), which enables the efficient incorporation of lexicons and language models into CTC decoding. Experiments show that compared with the standard hybrid DNN systems, Eesen achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.

Via

Access Paper or Ask Questions