Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Pushpak Bhattacharyya

Challenge Dataset of Cognates and False Friend Pairs from Indian Languages

Dec 17, 2021

Diptesh Kanojia, Pushpak Bhattacharyya, Malhar Kulkarni, Gholamreza Haffari

Figure 1 for Challenge Dataset of Cognates and False Friend Pairs from Indian Languages

Figure 2 for Challenge Dataset of Cognates and False Friend Pairs from Indian Languages

Figure 3 for Challenge Dataset of Cognates and False Friend Pairs from Indian Languages

Figure 4 for Challenge Dataset of Cognates and False Friend Pairs from Indian Languages

Abstract:Cognates are present in multiple variants of the same text across different languages (e.g., "hund" in German and "hound" in English language mean "dog"). They pose a challenge to various Natural Language Processing (NLP) applications such as Machine Translation, Cross-lingual Sense Disambiguation, Computational Phylogenetics, and Information Retrieval. A possible solution to address this challenge is to identify cognates across language pairs. In this paper, we describe the creation of two cognate datasets for twelve Indian languages, namely Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. We digitize the cognate data from an Indian language cognate dictionary and utilize linked Indian language Wordnets to generate cognate sets. Additionally, we use the Wordnet data to create a False Friends' dataset for eleven language pairs. We also evaluate the efficacy of our dataset using previously available baseline cognate detection approaches. We also perform a manual evaluation with the help of lexicographers and release the curated gold-standard dataset with this paper.

* Published at LREC 2020

Via

Access Paper or Ask Questions

Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages

Dec 16, 2021

Diptesh Kanojia, Raj Dabre, Shubham Dewangan, Pushpak Bhattacharyya, Gholamreza Haffari, Malhar Kulkarni

Figure 1 for Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages

Figure 2 for Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages

Figure 3 for Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages

Figure 4 for Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages

Abstract:Cognates are variants of the same lexical form across different languages; for example 'fonema' in Spanish and 'phoneme' in English are cognates, both of which mean 'a unit of sound'. The task of automatic detection of cognates among any two languages can help downstream NLP tasks such as Cross-lingual Information Retrieval, Computational Phylogenetics, and Machine Translation. In this paper, we demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian Languages. Our approach introduces the use of context from a knowledge graph to generate improved feature representations for cognate detection. We, then, evaluate the impact of our cognate detection mechanism on neural machine translation (NMT), as a downstream task. We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages, namely, Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. Additionally, we create evaluation datasets for two more Indian languages, Konkani and Nepali. We observe an improvement of up to 18% points, in terms of F-score, for cognate detection. Furthermore, we observe that cognates extracted using our method help improve NMT quality by up to 2.76 BLEU. We also release our code, newly constructed datasets and cross-lingual models publicly.

* Published at COLING 2020

Via

Access Paper or Ask Questions

Cognition-aware Cognate Detection

Dec 15, 2021

Diptesh Kanojia, Prashant Sharma, Sayali Ghodekar, Pushpak Bhattacharyya, Gholamreza Haffari, Malhar Kulkarni

Figure 1 for Cognition-aware Cognate Detection

Figure 2 for Cognition-aware Cognate Detection

Figure 3 for Cognition-aware Cognate Detection

Figure 4 for Cognition-aware Cognate Detection

Abstract:Automatic detection of cognates helps downstream NLP tasks of Machine Translation, Cross-lingual Information Retrieval, Computational Phylogenetics and Cross-lingual Named Entity Recognition. Previous approaches for the task of cognate detection use orthographic, phonetic and semantic similarity based features sets. In this paper, we propose a novel method for enriching the feature sets, with cognitive features extracted from human readers' gaze behaviour. We collect gaze behaviour data for a small sample of cognates and show that extracted cognitive features help the task of cognate detection. However, gaze data collection and annotation is a costly task. We use the collected gaze behaviour data to predict cognitive features for a larger sample and show that predicted cognitive features, also, significantly improve the task performance. We report improvements of 10% with the collected gaze features, and 12% using the predicted gaze features, over the previously proposed approaches. Furthermore, we release the collected gaze behaviour data along with our code and cross-lingual models.

* Published at EACL 2021

Via

Access Paper or Ask Questions

Natural Answer Generation: From Factoid Answer to Full-length Answer using Grammar Correction

Dec 07, 2021

Manas Jain, Sriparna Saha, Pushpak Bhattacharyya, Gladvin Chinnadurai, Manish Kumar Vatsa

Figure 1 for Natural Answer Generation: From Factoid Answer to Full-length Answer using Grammar Correction

Figure 2 for Natural Answer Generation: From Factoid Answer to Full-length Answer using Grammar Correction

Figure 3 for Natural Answer Generation: From Factoid Answer to Full-length Answer using Grammar Correction

Figure 4 for Natural Answer Generation: From Factoid Answer to Full-length Answer using Grammar Correction

Abstract:Question Answering systems these days typically use template-based language generation. Though adequate for a domain-specific task, these systems are too restrictive and predefined for domain-independent systems. This paper proposes a system that outputs a full-length answer given a question and the extracted factoid answer (short spans such as named entities) as the input. Our system uses constituency and dependency parse trees of questions. A transformer-based Grammar Error Correction model GECToR (2020), is used as a post-processing step for better fluency. We compare our system with (i) Modified Pointer Generator (SOTA) and (ii) Fine-tuned DialoGPT for factoid questions. We also test our approach on existential (yes-no) questions with better results. Our model generates accurate and fluent answers than the state-of-the-art (SOTA) approaches. The evaluation is done on NewsQA and SqUAD datasets with an increment of 0.4 and 0.9 percentage points in ROUGE-1 score respectively. Also the inference time is reduced by 85\% as compared to the SOTA. The improved datasets used for our evaluation will be released as part of the research contribution.

Via

Access Paper or Ask Questions

Tapping BERT for Preposition Sense Disambiguation

Nov 27, 2021

Siddhesh Pawar, Shyam Thombre, Anirudh Mittal, Girishkumar Ponkiya, Pushpak Bhattacharyya

Figure 1 for Tapping BERT for Preposition Sense Disambiguation

Figure 2 for Tapping BERT for Preposition Sense Disambiguation

Figure 3 for Tapping BERT for Preposition Sense Disambiguation

Figure 4 for Tapping BERT for Preposition Sense Disambiguation

Abstract:Prepositions are frequently occurring polysemous words. Disambiguation of prepositions is crucial in tasks like semantic role labelling, question answering, text entailment, and noun compound paraphrasing. In this paper, we propose a novel methodology for preposition sense disambiguation (PSD), which does not use any linguistic tools. In a supervised setting, the machine learning model is presented with sentences wherein prepositions have been annotated with senses. These senses are IDs in what is called The Preposition Project (TPP). We use the hidden layer representations from pre-trained BERT and BERT variants. The latent representations are then classified into the correct sense ID using a Multi Layer Perceptron. The dataset used for this task is from SemEval-2007 Task-6. Our methodology gives an accuracy of 86.85% which is better than the state-of-the-art.

Via

Access Paper or Ask Questions

"So You Think You're Funny?": Rating the Humour Quotient in Standup Comedy

Oct 25, 2021

Anirudh Mittal, Pranav Jeevan, Prerak Gandhi, Diptesh Kanojia, Pushpak Bhattacharyya

Figure 1 for "So You Think You're Funny?": Rating the Humour Quotient in Standup Comedy

Figure 2 for "So You Think You're Funny?": Rating the Humour Quotient in Standup Comedy

Figure 3 for "So You Think You're Funny?": Rating the Humour Quotient in Standup Comedy

Figure 4 for "So You Think You're Funny?": Rating the Humour Quotient in Standup Comedy

Abstract:Computational Humour (CH) has attracted the interest of Natural Language Processing and Computational Linguistics communities. Creating datasets for automatic measurement of humour quotient is difficult due to multiple possible interpretations of the content. In this work, we create a multi-modal humour-annotated dataset ($\sim$40 hours) using stand-up comedy clips. We devise a novel scoring mechanism to annotate the training data with a humour quotient score using the audience's laughter. The normalized duration (laughter duration divided by the clip duration) of laughter in each clip is used to compute this humour coefficient score on a five-point scale (0-4). This method of scoring is validated by comparing with manually annotated scores, wherein a quadratic weighted kappa of 0.6 is obtained. We use this dataset to train a model that provides a "funniness" score, on a five-point scale, given the audio and its corresponding text. We compare various neural language models for the task of humour-rating and achieve an accuracy of $0.813$ in terms of Quadratic Weighted Kappa (QWK). Our "Open Mic" dataset is released for further research along with the code.

* Accepted at EMNLP 2021 Main Conference (short papers); 4 pages, 1 figure, 3 tables

Via

Access Paper or Ask Questions

COVIDRead: A Large-scale Question Answering Dataset on COVID-19

Oct 05, 2021

Tanik Saikh, Sovan Kumar Sahoo, Asif Ekbal, Pushpak Bhattacharyya

Figure 1 for COVIDRead: A Large-scale Question Answering Dataset on COVID-19

Figure 2 for COVIDRead: A Large-scale Question Answering Dataset on COVID-19

Figure 3 for COVIDRead: A Large-scale Question Answering Dataset on COVID-19

Figure 4 for COVIDRead: A Large-scale Question Answering Dataset on COVID-19

Abstract:During this pandemic situation, extracting any relevant information related to COVID-19 will be immensely beneficial to the community at large. In this paper, we present a very important resource, COVIDRead, a Stanford Question Answering Dataset (SQuAD) like dataset over more than 100k question-answer pairs. The dataset consists of Context-Answer-Question triples. Primarily the questions from the context are constructed in an automated way. After that, the system-generated questions are manually checked by hu-mans annotators. This is a precious resource that could serve many purposes, ranging from common people queries regarding this very uncommon disease to managing articles by editors/associate editors of a journal. We establish several end-to-end neural network based baseline models that attain the lowest F1 of 32.03% and the highest F1 of 37.19%. To the best of our knowledge, we are the first to provide this kind of QA dataset in such a large volume on COVID-19. This dataset creates a new avenue of carrying out research on COVID-19 by providing a benchmark dataset and a baseline model.

* 20 pages, 7 figures

Via

Access Paper or Ask Questions

Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages

Sep 22, 2021

Tejas Indulal Dhamecha, Rudra Murthy V, Samarth Bharadwaj, Karthik Sankaranarayanan, Pushpak Bhattacharyya

Figure 1 for Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages

Figure 2 for Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages

Figure 3 for Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages

Figure 4 for Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages

Abstract:We explore the impact of leveraging the relatedness of languages that belong to the same family in NLP models using multilingual fine-tuning. We hypothesize and validate that multilingual fine-tuning of pre-trained language models can yield better performance on downstream NLP applications, compared to models fine-tuned on individual languages. A first of its kind detailed study is presented to track performance change as languages are added to a base language in a graded and greedy (in the sense of best boost of performance) manner; which reveals that careful selection of subset of related languages can significantly improve performance than utilizing all related languages. The Indo-Aryan (IA) language family is chosen for the study, the exact languages being Bengali, Gujarati, Hindi, Marathi, Oriya, Punjabi and Urdu. The script barrier is crossed by simple rule-based transliteration of the text of all languages to Devanagari. Experiments are performed on mBERT, IndicBERT, MuRIL and two RoBERTa-based LMs, the last two being pre-trained by us. Low resource languages, such as Oriya and Punjabi, are found to be the largest beneficiaries of multilingual fine-tuning. Textual Entailment, Entity Classification, Section Title Prediction, tasks of IndicGLUE and POS tagging form our test bed. Compared to monolingual fine tuning we get relative performance improvement of up to 150% in the downstream tasks. The surprise take-away is that for any language there is a particular combination of other languages which yields the best performance, and any additional language is in fact detrimental.

* Accepted in EMNLP 2021

Via

Access Paper or Ask Questions

M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in Conversations

Aug 03, 2021

Dushyant Singh Chauhan, Gopendra Vikram Singh, Navonil Majumder, Amir Zadeh, Asif Ekbal, Pushpak Bhattacharyya, Louis-philippe Morency, Soujanya Poria

Figure 1 for M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in Conversations

Figure 2 for M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in Conversations

Figure 3 for M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in Conversations

Figure 4 for M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in Conversations

Abstract:Humor recognition in conversations is a challenging task that has recently gained popularity due to its importance in dialogue understanding, including in multimodal settings (i.e., text, acoustics, and visual). The few existing datasets for humor are mostly in English. However, due to the tremendous growth in multilingual content, there is a great demand to build models and systems that support multilingual information access. To this end, we propose a dataset for Multimodal Multiparty Hindi Humor (M2H2) recognition in conversations containing 6,191 utterances from 13 episodes of a very popular TV series "Shrimaan Shrimati Phir Se". Each utterance is annotated with humor/non-humor labels and encompasses acoustic, visual, and textual modalities. We propose several strong multimodal baselines and show the importance of contextual and multimodal information for humor recognition in conversations. The empirical results on M2H2 dataset demonstrate that multimodal information complements unimodal information for humor recognition. The dataset and the baselines are available at http://www.iitp.ac.in/~ai-nlp-ml/resources.html and https://github.com/declare-lab/M2H2-dataset.

* ICMI 2021

Via

Access Paper or Ask Questions

Crosslingual Embeddings are Essential in UNMT for Distant Languages: An English to IndoAryan Case Study

Jun 09, 2021

Tamali Banerjee, Rudra Murthy V, Pushpak Bhattacharyya

Figure 1 for Crosslingual Embeddings are Essential in UNMT for Distant Languages: An English to IndoAryan Case Study

Figure 2 for Crosslingual Embeddings are Essential in UNMT for Distant Languages: An English to IndoAryan Case Study

Figure 3 for Crosslingual Embeddings are Essential in UNMT for Distant Languages: An English to IndoAryan Case Study

Figure 4 for Crosslingual Embeddings are Essential in UNMT for Distant Languages: An English to IndoAryan Case Study

Abstract:Recent advances in Unsupervised Neural Machine Translation (UNMT) have minimized the gap between supervised and unsupervised machine translation performance for closely related language pairs. However, the situation is very different for distant language pairs. Lack of lexical overlap and low syntactic similarities such as between English and Indo-Aryan languages leads to poor translation quality in existing UNMT systems. In this paper, we show that initializing the embedding layer of UNMT models with cross-lingual embeddings shows significant improvements in BLEU score over existing approaches with embeddings randomly initialized. Further, static embeddings (freezing the embedding layer weights) lead to better gains compared to updating the embedding layer weights during training (non-static). We experimented using Masked Sequence to Sequence (MASS) and Denoising Autoencoder (DAE) UNMT approaches for three distant language pairs. The proposed cross-lingual embedding initialization yields BLEU score improvement of as much as ten times over the baseline for English-Hindi, English-Bengali, and English-Gujarati. Our analysis shows the importance of cross-lingual embedding, comparisons between approaches, and the scope of improvements in these systems.

Via

Access Paper or Ask Questions