Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yerbolat Khassanov

A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline

Sep 22, 2020

Yerbolat Khassanov, Saida Mussakhojayeva, Almas Mirzakhmetov, Alen Adiyev, Mukhamet Nurpeiissov, Huseyin Atakan Varol

Figure 1 for A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline

Figure 2 for A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline

Figure 3 for A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline

Figure 4 for A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline

Abstract:We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 335 hours of transcribed audio comprising over 154,000 utterances spoken by participants from different regions, age groups, and gender. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications. In this paper, we first describe the data collection and prepossessing procedures followed by the description of the database specifications. We also share our experience and challenges faced during database construction. To demonstrate the reliability of the database, we performed the preliminary speech recognition experiments. The experimental results imply that the quality of audio and transcripts are promising. To enable experiment reproducibility and ease the corpus usage, we also released the ESPnet recipe.

* 8 pages, 2 figures, 3 tables

Via

Access Paper or Ask Questions

Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

May 28, 2020

Zhiping Zeng, Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Eng Siong Chng, Chongjia Ni, Bin Ma

Figure 1 for Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

Figure 2 for Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

Figure 3 for Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

Figure 4 for Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

Abstract:In this work, we study leveraging extra text data to improve low-resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend our prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the LSTM-based independent language model network. We conduct experiments on our in-house Malay corpus which contains limited labeled data and a large amount of extra text. Results show that the proposed architecture outperforms the previous LSTM-based architecture [1] by 24.2% relative word error rate (WER) when both are trained using limited labeled data. Starting from this, we obtain further 25.4% relative WER reduction by transfer learning from another resource-rich language. Moreover, we obtain additional 13.6% relative WER reduction by boosting the LSTM decoder of the transferred model with the extra text data. Overall, our best model outperforms the vanilla Transformer ASR by 11.9% relative WER. Last but not least, the proposed hybrid architecture offers much faster inference compared to both LSTM and Transformer architectures.

Via

Access Paper or Ask Questions

Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems

May 18, 2020

Tingzhi Mao, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Hao Huang, Eng Siong Chng

Figure 1 for Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems

Figure 2 for Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems

Figure 3 for Approaches to Improving Recognition of Underrepresented Named Entities in Hybrid ASR Systems

Abstract:In this paper, we present a series of complementary approaches to improve the recognition of underrepresented named entities (NE) in hybrid ASR systems without compromising overall word error rate performance. The underrepresented words correspond to rare or out-of-vocabulary (OOV) words in the training data, and thereby can't be modeled reliably. We begin with graphemic lexicon which allows to drop the necessity of phonetic models in hybrid ASR. We study it under different settings and demonstrate its effectiveness in dealing with underrepresented NEs. Next, we study the impact of neural language model (LM) with letter-based features derived to handle infrequent words. After that, we attempt to enrich representations of underrepresented NEs in pretrained neural LM by borrowing the embedding representations of rich-represented words. This let us gain significant performance improvement on underrepresented NE recognition. Finally, we boost the likelihood scores of utterances containing NEs in the word lattices rescored by neural LMs and gain further performance improvement. The combination of the aforementioned approaches improves NE recognition by up to 42% relatively.

Via

Access Paper or Ask Questions

Independent language modeling architecture for end-to-end ASR

Nov 25, 2019

Van Tung Pham, Haihua Xu, Yerbolat Khassanov, Zhiping Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma, Haizhou Li

Figure 1 for Independent language modeling architecture for end-to-end ASR

Figure 2 for Independent language modeling architecture for end-to-end ASR

Figure 3 for Independent language modeling architecture for end-to-end ASR

Figure 4 for Independent language modeling architecture for end-to-end ASR

Abstract:The attention-based end-to-end (E2E) automatic speech recognition (ASR) architecture allows for joint optimization of acoustic and language models within a single network. However, in a vanilla E2E ASR architecture, the decoder sub-network (subnet), which incorporates the role of the language model (LM), is conditioned on the encoder output. This means that the acoustic encoder and the language model are entangled that doesn't allow language model to be trained separately from external text data. To address this problem, in this work, we propose a new architecture that separates the decoder subnet from the encoder output. In this way, the decoupled subnet becomes an independently trainable LM subnet, which can easily be updated using the external text data. We study two strategies for updating the new architecture. Experimental results show that, 1) the independent LM architecture benefits from external text data, achieving 9.3% and 22.8% relative character and word error rate reduction on Mandarin HKUST and English NSC datasets respectively; 2)the proposed architecture works well with external LM and can be generalized to different amount of labelled data.

Via

Access Paper or Ask Questions

Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data

Apr 08, 2019

Yerbolat Khassanov, Haihua Xu, Van Tung Pham, Zhiping Zeng, Eng Siong Chng, Chongjia Ni, Bin Ma

Figure 1 for Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data

Figure 2 for Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data

Figure 3 for Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data

Figure 4 for Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data

Abstract:The lack of code-switch training data is one of the major concerns in the development of end-to-end code-switching automatic speech recognition (ASR) models. In this work, we propose a method to train an improved end-to-end code-switching ASR using only monolingual data. Our method encourages the distributions of output token embeddings of monolingual languages to be similar, and hence, promotes the ASR model to easily code-switch between languages. Specifically, we propose to use Jensen-Shannon divergence and cosine distance based constraints. The former will enforce output embeddings of monolingual languages to possess similar distributions, while the later simply brings the centroids of two distributions to be close to each other. Experimental results demonstrate high effectiveness of the proposed method, yielding up to 4.5% absolute mixed error rate improvement on Mandarin-English code-switching ASR task.

* 5 pages, 3 figures, submitted to INTERSPEECH 2019

Via

Access Paper or Ask Questions

Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation

Apr 08, 2019

Yerbolat Khassanov, Zhiping Zeng, Van Tung Pham, Haihua Xu, Eng Siong Chng

Figure 1 for Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation

Figure 2 for Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation

Figure 3 for Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation

Figure 4 for Enriching Rare Word Representations in Neural Language Models by Embedding Matrix Augmentation

Abstract:The neural language models (NLM) achieve strong generalization capability by learning the dense representation of words and using them to estimate probability distribution function. However, learning the representation of rare words is a challenging problem causing the NLM to produce unreliable probability estimates. To address this problem, we propose a method to enrich representations of rare words in pre-trained NLM and consequently improve its probability estimation performance. The proposed method augments the word embedding matrices of pre-trained NLM while keeping other parameters unchanged. Specifically, our method updates the embedding vectors of rare words using embedding vectors of other semantically and syntactically similar words. To evaluate the proposed method, we enrich the rare street names in the pre-trained NLM and use it to rescore 100-best hypotheses output from the Singapore English speech recognition system. The enriched NLM reduces the word error rate by 6% relative and improves the recognition accuracy of the rare words by 16% absolute as compared to the baseline NLM.

* 5 pages, 2 figures, submitted to INTERSPEECH 2019

Via

Access Paper or Ask Questions

On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition

Nov 01, 2018

Zhiping Zeng, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Eng Siong Chng, Haizhou Li

Figure 1 for On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition

Figure 2 for On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition

Figure 3 for On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition

Figure 4 for On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition

Abstract:Code-switching (CS) refers to a linguistic phenomenon where a speaker uses different languages in an utterance or between alternating utterances. In this work, we study end-to-end (E2E) approaches to the Mandarin-English code-switching speech recognition (CSSR) task. We first examine the effectiveness of using data augmentation and byte-pair encoding (BPE) subword units. More importantly, we propose a multitask learning recipe, where a language identification task is explicitly learned in addition to the E2E speech recognition task. Furthermore, we introduce an efficient word vocabulary expansion method for language modeling to alleviate data sparsity issues under the code-switching scenario. Experimental results on the SEAME data, a Mandarin-English CS corpus, demonstrate the effectiveness of the proposed methods.

* Submitted to ICASSP 2019

Via

Access Paper or Ask Questions

Unsupervised and Efficient Vocabulary Expansion for Recurrent Neural Network Language Models in ASR

Jun 27, 2018

Yerbolat Khassanov, Eng Siong Chng

Figure 1 for Unsupervised and Efficient Vocabulary Expansion for Recurrent Neural Network Language Models in ASR

Figure 2 for Unsupervised and Efficient Vocabulary Expansion for Recurrent Neural Network Language Models in ASR

Figure 3 for Unsupervised and Efficient Vocabulary Expansion for Recurrent Neural Network Language Models in ASR

Abstract:In automatic speech recognition (ASR) systems, recurrent neural network language models (RNNLM) are used to rescore a word lattice or N-best hypotheses list. Due to the expensive training, the RNNLM's vocabulary set accommodates only small shortlist of most frequent words. This leads to suboptimal performance if an input speech contains many out-of-shortlist (OOS) words. An effective solution is to increase the shortlist size and retrain the entire network which is highly inefficient. Therefore, we propose an efficient method to expand the shortlist set of a pretrained RNNLM without incurring expensive retraining and using additional training data. Our method exploits the structure of RNNLM which can be decoupled into three parts: input projection layer, middle layers, and output projection layer. Specifically, our method expands the word embedding matrices in projection layers and keeps the middle layers unchanged. In this approach, the functionality of the pretrained RNNLM will be correctly maintained as long as OOS words are properly modeled in two embedding spaces. We propose to model the OOS words by borrowing linguistic knowledge from appropriate in-shortlist words. Additionally, we propose to generate the list of OOS words to expand vocabulary in unsupervised manner by automatically extracting them from ASR output.

* 5 pages, 1 figure, accepted at INTERSPEECH 2018

Via

Access Paper or Ask Questions