Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sanjeev Khudanpur

Wake Word Detection with Streaming Transformers

Feb 08, 2021

Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur

Figure 1 for Wake Word Detection with Streaming Transformers

Figure 2 for Wake Word Detection with Streaming Transformers

Figure 3 for Wake Word Detection with Streaming Transformers

Figure 4 for Wake Word Detection with Streaming Transformers

Abstract:Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Transformer is not directly applicable to the task due to its non-streaming nature and the quadratic time and space complexity. In this paper we explore the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection in a recently proposed LF-MMI system, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks. Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate with a comparable model size, while still maintaining linear complexity w.r.t. the sequence length.

* Accepted at IEEE ICASSP 2021. 5 pages, 3 figures

Via

Access Paper or Ask Questions

The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

Feb 02, 2021

Shota Horiguchi, Nelson Yalta, Paola Garcia, Yuki Takashima, Yawen Xue, Desh Raj, Zili Huang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur

Figure 1 for The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

Figure 2 for The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

Figure 3 for The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

Figure 4 for The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

Abstract:This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge. The system outputs the ensemble results of the five subsystems: two x-vector-based subsystems, two end-to-end neural diarization-based subsystems, and one hybrid subsystem. We refine each system and all five subsystems become competitive and complementary. After the DOVER-Lap based system combination, it achieved diarization error rates of 11.58 % and 14.09 % in Track 1 full and core, and 16.94 % and 20.01 % in Track 2 full and core, respectively. With their results, we won second place in all the tasks of the challenge.

Via

Access Paper or Ask Questions

Fine-grained activity recognition for assembly videos

Dec 02, 2020

Jonathan D. Jones, Cathryn Cortesa, Amy Shelton, Barbara Landau, Sanjeev Khudanpur, Gregory D. Hager

Figure 1 for Fine-grained activity recognition for assembly videos

Figure 2 for Fine-grained activity recognition for assembly videos

Figure 3 for Fine-grained activity recognition for assembly videos

Figure 4 for Fine-grained activity recognition for assembly videos

Abstract:In this paper we address the task of recognizing assembly actions as a structure (e.g. a piece of furniture or a toy block tower) is built up from a set of primitive objects. Recognizing the full range of assembly actions requires perception at a level of spatial detail that has not been attempted in the action recognition literature to date. We extend the fine-grained activity recognition setting to address the task of assembly action recognition in its full generality by unifying assembly actions and kinematic structures within a single framework. We use this framework to develop a general method for recognizing assembly actions from observation sequences, along with observation features that take advantage of a spatial assembly's special structure. Finally, we evaluate our method empirically on two application-driven data sources: (1) An IKEA furniture-assembly dataset, and (2) A block-building dataset. On the first, our system recognizes assembly actions with an average framewise accuracy of 70% and an average normalized edit distance of 10%. On the second, which requires fine-grained geometric reasoning to distinguish between assemblies, our system attains an average normalized edit distance of 23% -- a relative improvement of 69% over prior work.

* 8 pages, 6 figures. Submitted to RA-L/ICRA 2021

Via

Access Paper or Ask Questions

Efficient MDI Adaptation for n-gram Language Models

Aug 05, 2020

Ruizhe Huang, Ke Li, Ashish Arora, Dan Povey, Sanjeev Khudanpur

Figure 1 for Efficient MDI Adaptation for n-gram Language Models

Figure 2 for Efficient MDI Adaptation for n-gram Language Models

Abstract:This paper presents an efficient algorithm for n-gram language model adaptation under the minimum discrimination information (MDI) principle, where an out-of-domain language model is adapted to satisfy the constraints of marginal probabilities of the in-domain data. The challenge for MDI language model adaptation is its computational complexity. By taking advantage of the backoff structure of n-gram model and the idea of hierarchical training method, originally proposed for maximum entropy (ME) language models, we show that MDI adaptation can be computed in linear-time complexity to the inputs in each iteration. The complexity remains the same as ME models, although MDI is more general than ME. This makes MDI adaptation practical for large corpus and vocabulary. Experimental results confirm the scalability of our algorithm on very large datasets, while MDI adaptation gets slightly worse perplexity but better word error rate results compared to simple linear interpolation.

* To appear in INTERSPEECH 2020. Appendix A of this full version will be filled soon

Via

Access Paper or Ask Questions

Wake Word Detection with Alignment-Free Lattice-Free MMI

May 25, 2020

Yiming Wang, Hang Lv, Daniel Povey, Lei Xie, Sanjeev Khudanpur

Figure 1 for Wake Word Detection with Alignment-Free Lattice-Free MMI

Figure 2 for Wake Word Detection with Alignment-Free Lattice-Free MMI

Figure 3 for Wake Word Detection with Alignment-Free Lattice-Free MMI

Figure 4 for Wake Word Detection with Alignment-Free Lattice-Free MMI

Abstract:Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data, and to use it in on-line applications: (i) we remove the prerequisite of frame-level alignments in the LF-MMI training algorithm, permitting the use of un-transcribed training examples that are annotated only for the presence/absence of the wake word; (ii) we show that the classical keyword/filler model must be supplemented with an explicit non-speech (silence) model for good performance; (iii) we present an FST-based decoder to perform online detection. We evaluate our methods on two real data sets, showing 50%--90% reduction in false rejection rates at pre-specified false alarm rates over the best previously published figures, and re-validate them on a third (large) data set.

* Submitted to Interspeech 2020. 5 pages, 3 figures

Via

Access Paper or Ask Questions

PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR

May 20, 2020

Yiwen Shao, Yiming Wang, Daniel Povey, Sanjeev Khudanpur

Figure 1 for PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR

Figure 2 for PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR

Figure 3 for PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR

Figure 4 for PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR

Abstract:We present PyChain, a fully parallelized PyTorch implementation of end-to-end lattice-free maximum mutual information (LF-MMI) training for the so-called \emph{chain models} in the Kaldi automatic speech recognition (ASR) toolkit. Unlike other PyTorch and Kaldi based ASR toolkits, PyChain is designed to be as flexible and light-weight as possible so that it can be easily plugged into new ASR projects, or other existing PyTorch-based ASR tools, as exemplified respectively by a new project PyChain-example, and Espresso, an existing end-to-end ASR toolkit. PyChain's efficiency and flexibility is demonstrated through such novel features as full GPU training on numerator/denominator graphs, and support for unequal length sequences. Experiments on the WSJ dataset show that with simple neural networks and commonly used machine learning techniques, PyChain can achieve competitive results that are comparable to Kaldi and better than other end-to-end ASR systems.

* Submtted to Interspeech 2020

Via

Access Paper or Ask Questions

CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

May 02, 2020

Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj(+11 more)

Figure 1 for CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Figure 2 for CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Figure 3 for CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Figure 4 for CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Abstract:Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules.

Via

Access Paper or Ask Questions

Espresso: A Fast End-to-end Neural Speech Recognition Toolkit

Oct 15, 2019

Yiming Wang, Tongfei Chen, Hainan Xu, Shuoyang Ding, Hang Lv, Yiwen Shao, Nanyun Peng, Lei Xie, Shinji Watanabe, Sanjeev Khudanpur

Figure 1 for Espresso: A Fast End-to-end Neural Speech Recognition Toolkit

Figure 2 for Espresso: A Fast End-to-end Neural Speech Recognition Toolkit

Figure 3 for Espresso: A Fast End-to-end Neural Speech Recognition Toolkit

Figure 4 for Espresso: A Fast End-to-end Neural Speech Recognition Toolkit

Abstract:We present Espresso, an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented. Espresso achieves state-of-the-art ASR performance on the WSJ, LibriSpeech, and Switchboard data sets among other end-to-end systems without data augmentation, and is 4--11x faster for decoding than similar systems (e.g. ESPnet).

* Accepted to ASRU 2019

Via

Access Paper or Ask Questions

Probing the Information Encoded in X-vectors

Sep 30, 2019

Desh Raj, David Snyder, Daniel Povey, Sanjeev Khudanpur

Figure 1 for Probing the Information Encoded in X-vectors

Figure 2 for Probing the Information Encoded in X-vectors

Figure 3 for Probing the Information Encoded in X-vectors

Figure 4 for Probing the Information Encoded in X-vectors

Abstract:Deep neural network based speaker embeddings, such as x-vectors, have been shown to perform well in text-independent speaker recognition/verification tasks. In this paper, we use simple classifiers to investigate the contents encoded by x-vector embeddings. We probe these embeddings for information related to the speaker, channel, transcription (sentence, words, phones), and meta information about the utterance (duration and augmentation type), and compare these with the information encoded by i-vectors across a varying number of dimensions. We also study the effect of data augmentation during extractor training on the information captured by x-vectors. Experiments on the RedDots data set show that x-vectors capture spoken content and channel-related information, while performing well on speaker verification tasks.

* Accepted at IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2019

Via

Access Paper or Ask Questions

Low Resource Multi-modal Data Augmentation for End-to-end ASR

Dec 10, 2018

Matthew Wiesner, Adithya Renduchintala, Shinji Watanabe, Chunxi Liu, Najim Dehak, Sanjeev Khudanpur

Figure 1 for Low Resource Multi-modal Data Augmentation for End-to-end ASR

Figure 2 for Low Resource Multi-modal Data Augmentation for End-to-end ASR

Figure 3 for Low Resource Multi-modal Data Augmentation for End-to-end ASR

Figure 4 for Low Resource Multi-modal Data Augmentation for End-to-end ASR

Abstract:We explore training attention-based encoder-decoder ASR for low-resource languages and present techniques that result in a 50% relative improvement in character error rate compared to a standard baseline. The performance of encoder-decoder ASR systems depends on having sufficient target-side text to train the attention and decoder networks. The lack of such data in low-resource contexts results in severely degraded performance. In this paper we present a data augmentation scheme tailored for low-resource ASR in diverse languages. Across 3 test languages, our approach resulted in a 20% average relative improvement over a baseline text-based augmentation technique. We further compare the performance of our monolingual text-based data augmentation to speech-based data augmentation from nearby languages and find that this gives a further 20-30% relative reduction in character error rate.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions