Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Preethi Jyothi

Indian Institute of Technology Bombay

LexGen: Domain-aware Multilingual Lexicon Generation

May 18, 2024

Karthika NJ, Ayush Maheshwari, Atul Kumar Singh, Preethi Jyothi, Ganesh Ramakrishnan, Krishnakant Bhatt

Figure 1 for LexGen: Domain-aware Multilingual Lexicon Generation

Figure 2 for LexGen: Domain-aware Multilingual Lexicon Generation

Figure 3 for LexGen: Domain-aware Multilingual Lexicon Generation

Figure 4 for LexGen: Domain-aware Multilingual Lexicon Generation

Abstract:Lexicon or dictionary generation across domains is of significant societal importance, as it can potentially enhance information accessibility for a diverse user base while preserving language identity. Prior work in the field primarily focuses on bilingual lexical induction, which deals with word alignments using mapping-based or corpora-based approaches. Though initiated by researchers, the research associated with lexicon generation is limited, even more so with domain-specific lexicons. This task becomes particularly important in atypical medical, engineering, and other technical domains, owing to the highly infrequent usage of the terms and negligibly low data availability of technical terms in many low-resource languages. Owing to the research gap in lexicon generation, especially with a limited focus on the domain-specific area, we propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting. Our model consists of domain-specific and domain-generic layers that encode information, and these layers are invoked via a learnable routing technique. Further, we propose an approach to explicitly leverage the relatedness between these Indian languages toward coherent translation. We also release a new benchmark dataset across 6 Indian languages that span 8 diverse domains that can propel further research in domain-specific lexicon induction. We conduct both zero-shot and few-shot experiments across multiple domains to show the efficacy of our proposed model in generalizing to unseen domains and unseen languages.

Via

Access Paper or Ask Questions

Gujarati-English Code-Switching Speech Recognition using ensemble prediction of spoken language

Mar 12, 2024

Yash Sharma, Basil Abraham, Preethi Jyothi

Abstract:An important and difficult task in code-switched speech recognition is to recognize the language, as lots of words in two languages can sound similar, especially in some accents. We focus on improving performance of end-to-end Automatic Speech Recognition models by conditioning transformer layers on language ID of words and character in the output in an per layer supervised manner. To this end, we propose two methods of introducing language specific parameters and explainability in the multi-head attention mechanism, and implement a Temporal Loss that helps maintain continuity in input alignment. Despite being unable to reduce WER significantly, our method shows promise in predicting the correct language from just spoken data. We introduce regularization in the language prediction by dropping LID in the sequence, which helps align long repeated output sequences.

* Bachelor's thesis, 28 pages, includes appendix

Via

Access Paper or Ask Questions

Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning

Feb 03, 2024

Ashish Sunil Agrawal, Barah Fazili, Preethi Jyothi

Figure 1 for Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning

Figure 2 for Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning

Figure 3 for Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning

Figure 4 for Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning

Abstract:Popular benchmarks (e.g., XNLI) used to evaluate cross-lingual language understanding consist of parallel versions of English evaluation sets in multiple target languages created with the help of professional translators. When creating such parallel data, it is critical to ensure high-quality translations for all target languages for an accurate characterization of cross-lingual transfer. In this work, we find that translation inconsistencies do exist and interestingly they disproportionally impact low-resource languages in XNLI. To identify such inconsistencies, we propose measuring the gap in performance between zero-shot evaluations on the human-translated and machine-translated target text across multiple target languages; relatively large gaps are indicative of translation errors. We also corroborate that translation errors exist for two target languages, namely Hindi and Urdu, by doing a manual reannotation of human-translated test instances in these two languages and finding poor agreement with the original English labels these instances were supposed to inherit.

* Accepted to main proceedings of "The 18th Conference of the European Chapter of the Association for Computational Linguistics"

Via

Access Paper or Ask Questions

Accented Speech Recognition With Accent-specific Codebooks

Oct 27, 2023

Darshan Prabhu, Preethi Jyothi, Sriram Ganapathy, Vinit Unni

Figure 1 for Accented Speech Recognition With Accent-specific Codebooks

Figure 2 for Accented Speech Recognition With Accent-specific Codebooks

Figure 3 for Accented Speech Recognition With Accent-specific Codebooks

Figure 4 for Accented Speech Recognition With Accent-specific Codebooks

Abstract:Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems. Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR. In this work, we propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks. These learnable codebooks capture accent-specific information and are integrated within the ASR encoder layers. The model is trained on accented English speech, while the test data also contained accents which were not seen during training. On the Mozilla Common Voice multi-accented dataset, we show that our proposed approach yields significant performance gains not only on the seen English accents (up to $37\%$ relative improvement in word error rate) but also on the unseen accents (up to $5\%$ relative improvement in WER). Further, we illustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We also compare the performance with other approaches based on accent adversarial training.

* Accepted to EMNLP 2023 Main Conference (Long Paper)

Via

Access Paper or Ask Questions

DISCO: A Large Scale Human Annotated Corpus for Disfluency Correction in Indo-European Languages

Oct 25, 2023

Vineet Bhat, Preethi Jyothi, Pushpak Bhattacharyya

Abstract:Disfluency correction (DC) is the process of removing disfluent elements like fillers, repetitions and corrections from spoken utterances to create readable and interpretable text. DC is a vital post-processing step applied to Automatic Speech Recognition (ASR) outputs, before subsequent processing by downstream language understanding tasks. Existing DC research has primarily focused on English due to the unavailability of large-scale open-source datasets. Towards the goal of multilingual disfluency correction, we present a high-quality human-annotated DC corpus covering four important Indo-European languages: English, Hindi, German and French. We provide extensive analysis of results of state-of-the-art DC models across all four languages obtaining F1 scores of 97.55 (English), 94.29 (Hindi), 95.89 (German) and 92.97 (French). To demonstrate the benefits of DC on downstream tasks, we show that DC leads to 5.65 points increase in BLEU scores on average when used in conjunction with a state-of-the-art Machine Translation (MT) system. We release code to run our experiments along with our annotated dataset here.

* Accepted at EMNLP 2023 Findings

Via

Access Paper or Ask Questions

Temporally Aligning Long Audio Interviews with Questions: A Case Study in Multimodal Data Integration

Oct 10, 2023

Piyush Singh Pasi, Karthikeya Battepati, Preethi Jyothi, Ganesh Ramakrishnan, Tanmay Mahapatra, Manoj Singh

Abstract:The problem of audio-to-text alignment has seen significant amount of research using complete supervision during training. However, this is typically not in the context of long audio recordings wherein the text being queried does not appear verbatim within the audio file. This work is a collaboration with a non-governmental organization called CARE India that collects long audio health surveys from young mothers residing in rural parts of Bihar, India. Given a question drawn from a questionnaire that is used to guide these surveys, we aim to locate where the question is asked within a long audio recording. This is of great value to African and Asian organizations that would otherwise have to painstakingly go through long and noisy audio recordings to locate questions (and answers) of interest. Our proposed framework, INDENT, uses a cross-attention-based model and prior information on the temporal ordering of sentences to learn speech embeddings that capture the semantics of the underlying spoken text. These learnt embeddings are used to retrieve the corresponding audio segment based on text queries at inference time. We empirically demonstrate the significant effectiveness (improvement in R-avg of about 3%) of our model over those obtained using text-based heuristics. We also show how noisy ASR, generated using state-of-the-art ASR models for Indian languages, yields better results when used in place of speech. INDENT, trained only on Hindi data is able to cater to all languages supported by the (semantically) shared text space. We illustrate this empirically on 11 Indic languages.

* Work Accepted in IJCAI-23- AI and Social Good Track

Via

Access Paper or Ask Questions

Improving RNN-Transducers with Acoustic LookAhead

Jul 11, 2023

Vinit S. Unni, Ashish Mittal, Preethi Jyothi, Sunita Sarawagi

Abstract:RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-end model for speech to text conversion because of their high accuracy and streaming capabilities. A typical RNN-T independently encodes the input audio and the text context, and combines the two encodings by a thin joint network. While this architecture provides SOTA streaming accuracy, it also makes the model vulnerable to strong LM biasing which manifests as multi-step hallucination of text without acoustic evidence. In this paper we propose LookAhead that makes text representations more acoustically grounded by looking ahead into the future within the audio input. This technique yields a significant 5%-20% relative reduction in word error rate on both in-domain and out-of-domain evaluation sets.

* 5 pages, 1 fig, 7 tables, Proceedings of Interspeech 2023

Via

Access Paper or Ask Questions

Adversarial Training For Low-Resource Disfluency Correction

Jun 10, 2023

Vineet Bhat, Preethi Jyothi, Pushpak Bhattacharyya

Abstract:Disfluencies commonly occur in conversational speech. Speech with disfluencies can result in noisy Automatic Speech Recognition (ASR) transcripts, which affects downstream tasks like machine translation. In this paper, we propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC) that utilizes a small amount of labeled real disfluent data in conjunction with a large amount of unlabeled data. We show the benefit of our proposed technique, which crucially depends on synthetically generated disfluent data, by evaluating it for DC in three Indian languages- Bengali, Hindi, and Marathi (all from the Indo-Aryan family). Our technique also performs well in removing stuttering disfluencies in ASR transcripts introduced by speech impairments. We achieve an average 6.15 points improvement in F1-score over competitive baselines across all three languages mentioned. To the best of our knowledge, we are the first to utilize adversarial training for DC and use it to correct stuttering disfluencies in English, establishing a new benchmark for this task.

* Accepted for Findings of ACL 2023

Via

Access Paper or Ask Questions

DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction

May 26, 2023

Vineet Bhat, Preethi Jyothi, Pushpak Bhattacharyya

Figure 1 for DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction

Figure 2 for DisfluencyFixer: A tool to enhance Language Learning through Speech To Speech Disfluency Correction

Abstract:Conversational speech often consists of deviations from the speech plan, producing disfluent utterances that affect downstream NLP tasks. Removing these disfluencies is necessary to create fluent and coherent speech. This paper presents DisfluencyFixer, a tool that performs speech-to-speech disfluency correction in English and Hindi using a pipeline of Automatic Speech Recognition (ASR), Disfluency Correction (DC) and Text-To-Speech (TTS) models. Our proposed system removes disfluencies from input speech and returns fluent speech as output along with its transcript, disfluency type and total disfluency count in source utterance, providing a one-stop destination for language learners to improve the fluency of their speech. We evaluate the performance of our tool subjectively and receive scores of 4.26, 4.29 and 4.42 out of 5 in ASR performance, DC performance and ease-of-use of the system. Our tool can be accessed openly at the following link.

* To be published in Interspeech 2023 - Show and Tell Demonstrations

Via

Access Paper or Ask Questions

Towards Zero-Shot Code-Switched Speech Recognition

Nov 09, 2022

Brian Yan, Matthew Wiesner, Ondrej Klejch, Preethi Jyothi, Shinji Watanabe

Figure 1 for Towards Zero-Shot Code-Switched Speech Recognition

Figure 2 for Towards Zero-Shot Code-Switched Speech Recognition

Figure 3 for Towards Zero-Shot Code-Switched Speech Recognition

Figure 4 for Towards Zero-Shot Code-Switched Speech Recognition

Abstract:In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditionally factorize the bilingual task into its constituent monolingual parts are a promising starting point for leveraging monolingual data efficiently. However, these methods require the monolingual modules to perform language segmentation. That is, each monolingual module has to simultaneously detect CS points and transcribe speech segments of one language while ignoring those of other languages -- not a trivial task. We propose to simplify each monolingual module by allowing them to transcribe all speech segments indiscriminately with a monolingual script (i.e. transliteration). This simple modification passes the responsibility of CS point detection to subsequent bilingual modules which determine the final output by considering multiple monolingual transliterations along with external language model information. We apply this transliteration-based approach in an end-to-end differentiable neural network and demonstrate its efficacy for zero-shot CS ASR on Mandarin-English SEAME test sets.

* 5 pages

Via

Access Paper or Ask Questions