Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Antonios Anastasopoulos

Archimedes, Athena Research Center, Greece, Department of Computer Science, George Mason University

BiasDora: Exploring Hidden Biased Associations in Vision-Language Models

Jul 02, 2024

Chahat Raj, Anjishnu Mukherjee, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu

Figure 1 for BiasDora: Exploring Hidden Biased Associations in Vision-Language Models

Figure 2 for BiasDora: Exploring Hidden Biased Associations in Vision-Language Models

Figure 3 for BiasDora: Exploring Hidden Biased Associations in Vision-Language Models

Figure 4 for BiasDora: Exploring Hidden Biased Associations in Vision-Language Models

Abstract:Existing works examining Vision Language Models (VLMs) for social biases predominantly focus on a limited set of documented bias associations, such as gender:profession or race:crime. This narrow scope often overlooks a vast range of unexamined implicit associations, restricting the identification and, hence, mitigation of such biases. We address this gap by probing VLMs to (1) uncover hidden, implicit associations across 9 bias dimensions. We systematically explore diverse input and output modalities and (2) demonstrate how biased associations vary in their negativity, toxicity, and extremity. Our work (3) identifies subtle and extreme biases that are typically not recognized by existing methodologies. We make the Dataset of retrieved associations, (Dora), publicly available here https://github.com/chahatraj/BiasDora.

* Under Review

Via

Access Paper or Ask Questions

Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Jul 01, 2024

Pooya Fayyazsanavi, Antonios Anastasopoulos, Jana Košecká

Figure 1 for Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Figure 2 for Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Figure 3 for Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Figure 4 for Gloss2Text: Sign Language Gloss translation using LLMs and Semantically Aware Label Smoothing

Abstract:Sign language translation from video to spoken text presents unique challenges owing to the distinct grammar, expression nuances, and high variation of visual appearance across different speakers and contexts. The intermediate gloss annotations of videos aim to guide the translation process. In our work, we focus on {\em Gloss2Text} translation stage and propose several advances by leveraging pre-trained large language models (LLMs), data augmentation, and novel label-smoothing loss function exploiting gloss translation ambiguities improving significantly the performance of state-of-the-art approaches. Through extensive experiments and ablation studies on the PHOENIX Weather 2014T dataset, our approach surpasses state-of-the-art performance in {\em Gloss2Text} translation, indicating its efficacy in addressing sign language translation and suggesting promising avenues for future research and development.

Via

Access Paper or Ask Questions

Script-Agnostic Language Identification

Jun 25, 2024

Milind Agarwal, Joshua Otten, Antonios Anastasopoulos

Abstract:Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. Moreover, languages with different writing systems do not share significant lexical, semantic, and syntactic properties in neural representation spaces, which is a disadvantage for closely related languages and low-resource languages, especially those from the Indian Subcontinent. To counter this, we propose learning script-agnostic representations using several different experimental strategies (upscaling, flattening, and script mixing) focusing on four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification, while also maintaining competitive performance on naturally occurring text.

* Under Review in ACL Rolling Review

Via

Access Paper or Ask Questions

Unlearning Climate Misinformation in Large Language Models

May 29, 2024

Michael Fore, Simranjit Singh, Chaehong Lee, Amritanshu Pandey, Antonios Anastasopoulos, Dimitrios Stamoulis

Abstract:Misinformation regarding climate change is a key roadblock in addressing one of the most serious threats to humanity. This paper investigates factual accuracy in large language models (LLMs) regarding climate information. Using true/false labeled Q&A data for fine-tuning and evaluating LLMs on climate-related claims, we compare open-source models, assessing their ability to generate truthful responses to climate change questions. We investigate the detectability of models intentionally poisoned with false climate information, finding that such poisoning may not affect the accuracy of a model's responses in other domains. Furthermore, we compare the effectiveness of unlearning algorithms, fine-tuning, and Retrieval-Augmented Generation (RAG) for factually grounding LLMs on climate change topics. Our evaluation reveals that unlearning algorithms can be effective for nuanced conceptual claims, despite previous findings suggesting their inefficacy in privacy contexts. These insights aim to guide the development of more factually reliable LLMs and highlight the need for additional work to secure LLMs against misinformation attacks.

Via

Access Paper or Ask Questions

EmoMix-3L: A Code-Mixed Dataset for Bangla-English-Hindi Emotion Detection

May 11, 2024

Nishat Raihan, Dhiman Goswami, Antara Mahmud, Antonios Anastasopoulos, Marcos Zampieri

Abstract:Code-mixing is a well-studied linguistic phenomenon that occurs when two or more languages are mixed in text or speech. Several studies have been conducted on building datasets and performing downstream NLP tasks on code-mixed data. Although it is not uncommon to observe code-mixing of three or more languages, most available datasets in this domain contain code-mixed data from only two languages. In this paper, we introduce EmoMix-3L, a novel multi-label emotion detection dataset containing code-mixed data from three different languages. We experiment with several models on EmoMix-3L and we report that MuRIL outperforms other models on this dataset.

* arXiv admin note: substantial text overlap with arXiv:2310.18387, arXiv:2310.18023

Via

Access Paper or Ask Questions

Data-Augmentation-Based Dialectal Adaptation for LLMs

Apr 11, 2024

Fahim Faisal, Antonios Anastasopoulos

Figure 1 for Data-Augmentation-Based Dialectal Adaptation for LLMs

Figure 2 for Data-Augmentation-Based Dialectal Adaptation for LLMs

Figure 3 for Data-Augmentation-Based Dialectal Adaptation for LLMs

Figure 4 for Data-Augmentation-Based Dialectal Adaptation for LLMs

Abstract:This report presents GMUNLP's participation to the Dialect-Copa shared task at VarDial 2024, which focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects. The task aims to assess how well LLMs can handle non-standard dialectal varieties, as their performance on standard languages is already well-established. We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance on three South Slavic dialects: Chakavian, Cherkano, and Torlak. We conduct experiments using a language-family-focused encoder-based model (BERTi\'c) and a domain-agnostic multilingual model (AYA-101). Our results demonstrate that the proposed data augmentation techniques lead to substantial performance gains across all three test datasets in the open-source model category. This work highlights the practical utility of data augmentation and the potential of LLMs in handling non-standard dialectal varieties, contributing to the broader goal of advancing natural language understanding in low-resource and dialectal settings. Code:https://github.com/ffaisal93/dialect_copa

Via

Access Paper or Ask Questions

CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models

Apr 03, 2024

Zaid Sheikh, Antonios Anastasopoulos, Shruti Rijhwani, Lindia Tjuatja, Robbie Jimerson, Graham Neubig

Abstract:Effectively using Natural Language Processing (NLP) tools in under-resourced languages requires a thorough understanding of the language itself, familiarity with the latest models and training methodologies, and technical expertise to deploy these models. This could present a significant obstacle for language community members and linguists to use NLP tools. This paper introduces the CMU Linguistic Annotation Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models. CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages, even with limited training data. We describe various tools and APIs that are currently available and how developers can easily add new models/functionality to the framework. Code is available at https://github.com/neulab/cmulab along with a live demo at https://cmulab.dev

* Live demo at https://cmulab.dev

Via

Access Paper or Ask Questions

A Study on Scaling Up Multilingual News Framing Analysis

Apr 01, 2024

Syeda Sabrina Akter, Antonios Anastasopoulos

Figure 1 for A Study on Scaling Up Multilingual News Framing Analysis

Figure 2 for A Study on Scaling Up Multilingual News Framing Analysis

Figure 3 for A Study on Scaling Up Multilingual News Framing Analysis

Figure 4 for A Study on Scaling Up Multilingual News Framing Analysis

Abstract:Media framing is the study of strategically selecting and presenting specific aspects of political issues to shape public opinion. Despite its relevance to almost all societies around the world, research has been limited due to the lack of available datasets and other resources. This study explores the possibility of dataset creation through crowdsourcing, utilizing non-expert annotators to develop training corpora. We first extend framing analysis beyond English news to a multilingual context (12 typologically diverse languages) through automatic translation. We also present a novel benchmark in Bengali and Portuguese on the immigration and same-sex marriage domains. Additionally, we show that a system trained on our crowd-sourced dataset, combined with other existing ones, leads to a 5.32 percentage point increase from the baseline, showing that crowdsourcing is a viable option. Last, we study the performance of large language models (LLMs) for this task, finding that task-specific fine-tuning is a better approach than employing bigger non-specialized models.

* accepted at NAACL 2024

Via

Access Paper or Ask Questions

An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models

Mar 29, 2024

Fahim Faisal, Antonios Anastasopoulos

Figure 1 for An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models

Figure 2 for An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models

Figure 3 for An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models

Figure 4 for An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models

Abstract:The capacity and effectiveness of pre-trained multilingual models (MLMs) for zero-shot cross-lingual transfer is well established. However, phenomena of positive or negative transfer, and the effect of language choice still need to be fully understood, especially in the complex setting of massively multilingual LMs. We propose an \textit{efficient} method to study transfer language influence in zero-shot performance on another target language. Unlike previous work, our approach disentangles downstream tasks from language, using dedicated adapter units. Our findings suggest that some languages do not largely affect others, while some languages, especially ones unseen during pre-training, can be extremely beneficial or detrimental for different target languages. We find that no transfer language is beneficial for all target languages. We do, curiously, observe languages previously unseen by MLMs consistently benefit from transfer from almost any language. We additionally use our modular approach to quantify negative interference efficiently and categorize languages accordingly. Furthermore, we provide a list of promising transfer-target language configurations that consistently lead to target language performance improvements. Code and data are publicly available: https://github.com/ffaisal93/neg_inf

Via

Access Paper or Ask Questions

DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

Mar 16, 2024

Fahim Faisal, Orevaoghene Ahia, Aarohi Srivastava, Kabir Ahuja, David Chiang, Yulia Tsvetkov, Antonios Anastasopoulos

Figure 1 for DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

Figure 2 for DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

Figure 3 for DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

Figure 4 for DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

Abstract:Language technologies should be judged on their usefulness in real-world use cases. An often overlooked aspect in natural language processing (NLP) research and evaluation is language variation in the form of non-standard dialects or language varieties (hereafter, varieties). Most NLP benchmarks are limited to standard language varieties. To fill this gap, we propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties, which aggregates an extensive set of task-varied variety datasets (10 text-level tasks covering 281 varieties). This allows for a comprehensive evaluation of NLP system performance on different language varieties. We provide substantial evidence of performance disparities between standard and non-standard language varieties, and we also identify language clusters with large performance divergence across tasks. We believe DIALECTBENCH provides a comprehensive view of the current state of NLP for language varieties and one step towards advancing it further. Code/data: https://github.com/ffaisal93/DialectBench

* Equal contribution: Fahim Faisal, Orevaoghene Ahia

Via

Access Paper or Ask Questions