Alert button
Picture for Santosh Kesiraju

Santosh Kesiraju

Alert button

Strategies for improving low resource speech to text translation relying on pre-trained ASR models

May 31, 2023
Santosh Kesiraju, Marek Sarvas, Tomas Pavlicek, Cecile Macaire, Alejandro Ciuba

Figure 1 for Strategies for improving low resource speech to text translation relying on pre-trained ASR models
Figure 2 for Strategies for improving low resource speech to text translation relying on pre-trained ASR models
Figure 3 for Strategies for improving low resource speech to text translation relying on pre-trained ASR models
Figure 4 for Strategies for improving low resource speech to text translation relying on pre-trained ASR models

This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST). We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively. Using the encoder-decoder framework for ST, our results show that a multilingual automatic speech recognition system acts as a good initialization under low-resource scenarios. Furthermore, using the CTC as an additional objective for translation during training and decoding helps to reorder the internal representations and improves the final translation. Through our experiments, we try to identify various factors (initializations, objectives, and hyper-parameters) that contribute the most for improvements in low-resource setups. With only 300 hours of pre-training data, our model achieved 7.3 BLEU score on Tamasheq - French data, outperforming prior published works from IWSLT 2022 by 1.6 points.

Viaarxiv icon

Detecting English Speech in the Air Traffic Control Voice Communication

Apr 06, 2021
Igor Szoke, Santosh Kesiraju, Ondrej Novotny, Martin Kocour, Karel Vesely, Jan "Honza" Cernocky

Figure 1 for Detecting English Speech in the Air Traffic Control Voice Communication
Figure 2 for Detecting English Speech in the Air Traffic Control Voice Communication
Figure 3 for Detecting English Speech in the Air Traffic Control Voice Communication
Figure 4 for Detecting English Speech in the Air Traffic Control Voice Communication

We launched a community platform for collecting the ATC speech world-wide in the ATCO2 project. Filtering out unseen non-English speech is one of the main components in the data processing pipeline. The proposed English Language Detection (ELD) system is based on the embeddings from Bayesian subspace multinomial model. It is trained on the word confusion network from an ASR system. It is robust, easy to train, and light weighted. We achieved 0.0439 equal-error-rate (EER), a 50% relative reduction as compared to the state-of-the-art acoustic ELD system based on x-vectors, in the in-domain scenario. Further, we achieved an EER of 0.1352, a 33% relative reduction as compared to the acoustic ELD, in the unseen language (out-of-domain) condition. We plan to publish the evaluation dataset from the ATCO2 project.

Viaarxiv icon

Rethinking the objectives of extractive question answering

Aug 28, 2020
Martin Fajcik, Josef Jon, Santosh Kesiraju, Pavel Smrz

Figure 1 for Rethinking the objectives of extractive question answering
Figure 2 for Rethinking the objectives of extractive question answering
Figure 3 for Rethinking the objectives of extractive question answering
Figure 4 for Rethinking the objectives of extractive question answering

This paper describes two generally applicable approaches towards the significant improvement of the performance of state-of-the-art extractive question answering (EQA) systems. Firstly, contrary to a common belief, it demonstrates that using the objective with independence assumption for span probability $P(a_s,a_e) = P(a_s)P(a_e)$ of span starting at position $a_s$ and ending at position $a_e$ may have adverse effects. Therefore we propose a new compound objective that models joint probability $P(a_s,a_e)$ directly, while still keeping the objective with independency assumption as an auxiliary objective. Our second approach shows the beneficial effect of distantly semi-supervised shared-normalization objective known from (Clark and Gardner, 2017). We show that normalizing over a set of documents similar to the golden passage, and marginalizing over all ground-truth answer string positions leads to the improvement of results from smaller statistical models. Our results are supported via experiments with three QA models (BidAF, BERT, ALBERT) over six datasets. The proposed approaches do not use any additional data. Our code, analysis, pretrained models, and individual results will be available online.

* Preprint version 
Viaarxiv icon

Bayesian multilingual topic model for zero-shot cross-lingual topic identification

Jul 02, 2020
Santosh Kesiraju, Sangeet Sagar, Ondřej Glembek, Lukáš Burget, Suryakanth V Gangashetty

Figure 1 for Bayesian multilingual topic model for zero-shot cross-lingual topic identification
Figure 2 for Bayesian multilingual topic model for zero-shot cross-lingual topic identification
Figure 3 for Bayesian multilingual topic model for zero-shot cross-lingual topic identification
Figure 4 for Bayesian multilingual topic model for zero-shot cross-lingual topic identification

This paper presents a Bayesian multilingual topic model for learning language-independent document embeddings. Our model learns to represent the documents in the form of Gaussian distributions, thereby encoding the uncertainty in its covariance. We propagate the learned uncertainties through linear classifiers for zero-shot cross-lingual topic identification. Our experiments on 5 language Europarl and Reuters (MLDoc) corpora show that the proposed model outperforms multi-lingual word embedding and BiLSTM sentence encoder based systems with significant margins in the majority of the transfer directions. Moreover, our system trained under a single day on a single GPU with much lower amounts of data performs competitively as compared to the state-of-the-art universal BiLSTM sentence encoder trained on 93 languages. Our experimental analysis shows that the amount of parallel data improves the overall performance of embeddings. Nonetheless, exploiting the uncertainties is always beneficial.

* 10 pages, 5 figures 
Viaarxiv icon

Learning document embeddings along with their uncertainties

Aug 29, 2019
Santosh Kesiraju, Oldřich Plchot, Lukáš Burget, Suryakanth V Gangashetty

Figure 1 for Learning document embeddings along with their uncertainties
Figure 2 for Learning document embeddings along with their uncertainties
Figure 3 for Learning document embeddings along with their uncertainties
Figure 4 for Learning document embeddings along with their uncertainties

Majority of the text modelling techniques yield only point estimates of document embeddings and lack in capturing the uncertainty of the estimates. These uncertainties give a notion of how well the embeddings represent a document. We present Bayesian subspace multinomial model (Bayesian SMM), a generative log-linear model that learns to represent documents in the form of Gaussian distributions, thereby encoding the uncertainty in its covariance. Additionally, in the proposed Bayesian SMM, we address a commonly encountered problem of intractability that appears during variational inference in mixed-logit models. We also present a generative Gaussian linear classifier for topic identification that exploits the uncertainty in document embeddings. Our intrinsic evaluation using perplexity measure shows that the proposed Bayesian SMM fits the data better as compared to variational auto-encoder based document model. Our topic identification experiments on speech (Fisher) and text (20Newsgroups) corpora show that the proposed Bayesian SMM is robust to over-fitting on unseen test data. The topic ID results show that the proposed model is significantly better than variational auto-encoder based methods and achieve similar results when compared to fully supervised discriminative models.

Viaarxiv icon

An Empirical Evaluation of Zero Resource Acoustic Unit Discovery

Feb 05, 2017
Chunxi Liu, Jinyi Yang, Ming Sun, Santosh Kesiraju, Alena Rott, Lucas Ondel, Pegah Ghahremani, Najim Dehak, Lukas Burget, Sanjeev Khudanpur

Figure 1 for An Empirical Evaluation of Zero Resource Acoustic Unit Discovery
Figure 2 for An Empirical Evaluation of Zero Resource Acoustic Unit Discovery

Acoustic unit discovery (AUD) is a process of automatically identifying a categorical acoustic unit inventory from speech and producing corresponding acoustic unit tokenizations. AUD provides an important avenue for unsupervised acoustic model training in a zero resource setting where expert-provided linguistic knowledge and transcribed speech are unavailable. Therefore, to further facilitate zero-resource AUD process, in this paper, we demonstrate acoustic feature representations can be significantly improved by (i) performing linear discriminant analysis (LDA) in an unsupervised self-trained fashion, and (ii) leveraging resources of other languages through building a multilingual bottleneck (BN) feature extractor to give effective cross-lingual generalization. Moreover, we perform comprehensive evaluations of AUD efficacy on multiple downstream speech applications, and their correlated performance suggests that AUD evaluations are feasible using different alternative language resources when only a subset of these evaluation resources can be available in typical zero resource applications.

* 5 pages, 1 figure; Accepted for publication at ICASSP 2017 
Viaarxiv icon