Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Diptesh Kanojia

Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content

Feb 11, 2025

Girish A. Koushik, Diptesh Kanojia, Helen Treharne

Figure 1 for Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content

Figure 2 for Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content

Figure 3 for Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content

Figure 4 for Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content

Abstract:Social media platforms enable the propagation of hateful content across different modalities such as textual, auditory, and visual, necessitating effective detection methods. While recent approaches have shown promise in handling individual modalities, their effectiveness across different modality combinations remains unexplored. This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content. Our comprehensive evaluation reveals significant modality-specific limitations: while simple embedding fusion achieves state-of-the-art performance on video content (HateMM dataset) with a 9.9% points F1-score improvement, it struggles with complex image-text relationships in memes (Hateful Memes dataset). Through detailed ablation studies and error analysis, we demonstrate how current fusion approaches fail to capture nuanced cross-modal interactions, particularly in cases involving benign confounders. Our findings provide crucial insights for developing more robust hate detection systems and highlight the need for modality-specific architectural considerations. The code is available at https://github.com/gak97/Video-vs-Meme-Hate.

* Accepted to the MM4SG Workshop at the WebConf 2025

Via

Access Paper or Ask Questions

Giving the Old a Fresh Spin: Quality Estimation-Assisted Constrained Decoding for Automatic Post-Editing

Jan 28, 2025

Sourabh Deoghare, Diptesh Kanojia, Pushpak Bhattacharyya

Abstract:Automatic Post-Editing (APE) systems often struggle with over-correction, where unnecessary modifications are made to a translation, diverging from the principle of minimal editing. In this paper, we propose a novel technique to mitigate over-correction by incorporating word-level Quality Estimation (QE) information during the decoding process. This method is architecture-agnostic, making it adaptable to any APE system, regardless of the underlying model or training approach. Our experiments on English-German, English-Hindi, and English-Marathi language pairs show the proposed approach yields significant improvements over their corresponding baseline APE systems, with TER gains of $0.65$, $1.86$, and $1.44$ points, respectively. These results underscore the complementary relationship between QE and APE tasks and highlight the effectiveness of integrating QE information to reduce over-correction in APE systems.

* Accepted to NAACL 2025 Main Conference: Short Papers

Via

Access Paper or Ask Questions

When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages

Jan 08, 2025

Archchana Sindhujan, Diptesh Kanojia, Constantin Orasan, Shenbin Qian

Figure 1 for When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages

Figure 2 for When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages

Figure 3 for When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages

Figure 4 for When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages

Abstract:This paper investigates the reference-less evaluation of machine translation for low-resource language pairs, known as quality estimation (QE). Segment-level QE is a challenging cross-lingual language understanding task that provides a quality score (0-100) to the translated output. We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios and perform instruction fine-tuning using a novel prompt based on annotation guidelines. Our results indicate that prompt-based approaches are outperformed by the encoder-based fine-tuned QE models. Our error analysis reveals tokenization issues, along with errors due to transliteration and named entities, and argues for refinement in LLM pre-training for cross-lingual tasks. We release the data, and models trained publicly for further research.

Via

Access Paper or Ask Questions

PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

Dec 10, 2024

Fatemeh Nazarieh, Zhenhua Feng, Diptesh Kanojia, Muhammad Awais, Josef Kittler

Figure 1 for PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

Figure 2 for PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

Figure 3 for PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

Figure 4 for PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

Abstract:Audio-driven talking face generation is a challenging task in digital communication. Despite significant progress in the area, most existing methods concentrate on audio-lip synchronization, often overlooking aspects such as visual quality, customization, and generalization that are crucial to producing realistic talking faces. To address these limitations, we introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk. Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet. IdentityNet is designed to preserve identity features consistently across the generated video frames, while AnimateNet aims to enhance temporal coherence and motion consistency. This framework also integrates an audio input with the reference images, thereby reducing the reliance on reference-style videos prevalent in existing approaches. A key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms, which significantly expands creative control over the generated videos. Through extensive experiments, including a newly developed evaluation metric, our model demonstrates superior performance over the state-of-the-art methods, setting a new standard for the generation of customizable realistic talking faces suitable for real-world applications.

Via

Access Paper or Ask Questions

BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English

Dec 06, 2024

Dipankar Srirag, Aditya Joshi, Jordan Painter, Diptesh Kanojia

Figure 1 for BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English

Figure 2 for BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English

Figure 3 for BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English

Figure 4 for BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English

Abstract:Despite large language models (LLMs) being known to exhibit bias against non-mainstream varieties, there are no known labeled datasets for sentiment analysis of English. To address this gap, we introduce BESSTIE, a benchmark for sentiment and sarcasm classification for three varieties of English: Australian (en-AU), Indian (en-IN), and British (en-UK). Using web-based content from two domains, namely, Google Place reviews and Reddit comments, we collect datasets for these language varieties using two methods: location-based and topic-based filtering. Native speakers of the language varieties manually annotate the datasets with sentiment and sarcasm labels. Subsequently, we fine-tune nine large language models (LLMs) (representing a range of encoder/decoder and mono/multilingual models) on these datasets, and evaluate their performance on the two tasks. Our results reveal that the models consistently perform better on inner-circle varieties (i.e., en-AU and en-UK), with significant performance drops for en-IN, particularly in sarcasm detection. We also report challenges in cross-variety generalisation, highlighting the need for language variety-specific datasets such as ours. BESSTIE promises to be a useful evaluative benchmark for future research in equitable LLMs, specifically in terms of language varieties. The BESSTIE datasets, code, and models are currently available on request, while the paper is under review. Please email aditya.joshi@unsw.edu.au.

* 10 pages, 7 figures, under review

Via

Access Paper or Ask Questions

A Survey of Multimodal Sarcasm Detection

Oct 24, 2024

Shafkat Farabi, Tharindu Ranasinghe, Diptesh Kanojia, Yu Kong, Marcos Zampieri

Figure 1 for A Survey of Multimodal Sarcasm Detection

Figure 2 for A Survey of Multimodal Sarcasm Detection

Figure 3 for A Survey of Multimodal Sarcasm Detection

Figure 4 for A Survey of Multimodal Sarcasm Detection

Abstract:Sarcasm is a rhetorical device that is used to convey the opposite of the literal meaning of an utterance. Sarcasm is widely used on social media and other forms of computer-mediated communication motivating the use of computational models to identify it automatically. While the clear majority of approaches to sarcasm detection have been carried out on text only, sarcasm detection often requires additional information present in tonality, facial expression, and contextual images. This has led to the introduction of multimodal models, opening the possibility to detect sarcasm in multiple modalities such as audio, images, text, and video. In this paper, we present the first comprehensive survey on multimodal sarcasm detection - henceforth MSD - to date. We survey papers published between 2018 and 2023 on the topic, and discuss the models and datasets used for this task. We also present future research directions in MSD.

* Published in the Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence Survey Track. Pages 8020-8028

Via

Access Paper or Ask Questions

Together We Can: Multilingual Automatic Post-Editing for Low-Resource Languages

Oct 23, 2024

Sourabh Deoghare, Diptesh Kanojia, Pushpak Bhattacharyya

Figure 1 for Together We Can: Multilingual Automatic Post-Editing for Low-Resource Languages

Figure 2 for Together We Can: Multilingual Automatic Post-Editing for Low-Resource Languages

Figure 3 for Together We Can: Multilingual Automatic Post-Editing for Low-Resource Languages

Figure 4 for Together We Can: Multilingual Automatic Post-Editing for Low-Resource Languages

Abstract:This exploratory study investigates the potential of multilingual Automatic Post-Editing (APE) systems to enhance the quality of machine translations for low-resource Indo-Aryan languages. Focusing on two closely related language pairs, English-Marathi and English-Hindi, we exploit the linguistic similarities to develop a robust multilingual APE model. To facilitate cross-linguistic transfer, we generate synthetic Hindi-Marathi and Marathi-Hindi APE triplets. Additionally, we incorporate a Quality Estimation (QE)-APE multi-task learning framework. While the experimental results underline the complementary nature of APE and QE, we also observe that QE-APE multitask learning facilitates effective domain adaptation. Our experiments demonstrate that the multilingual APE models outperform their corresponding English-Hindi and English-Marathi single-pair models by $2.5$ and $2.39$ TER points, respectively, with further notable improvements over the multilingual APE model observed through multi-task learning ($+1.29$ and $+1.44$ TER points), data augmentation ($+0.53$ and $+0.45$ TER points) and domain adaptation ($+0.35$ and $+0.45$ TER points). We release the synthetic data, code, and models accrued during this study publicly at https://github.com/cfiltnlp/Multilingual-APE.

* Accepted at Findings of EMNLP 2024

Via

Access Paper or Ask Questions

Centrality-aware Product Retrieval and Ranking

Oct 21, 2024

Hadeel Saadany, Swapnil Bhosale, Samarth Agrawal, Diptesh Kanojia, Constantin Orasan, Zhe Wu

Abstract:This paper addresses the challenge of improving user experience on e-commerce platforms by enhancing product ranking relevant to users' search queries. Ambiguity and complexity of user queries often lead to a mismatch between the user's intent and retrieved product titles or documents. Recent approaches have proposed the use of Transformer-based models, which need millions of annotated query-title pairs during the pre-training stage, and this data often does not take user intent into account. To tackle this, we curate samples from existing datasets at eBay, manually annotated with buyer-centric relevance scores and centrality scores, which reflect how well the product title matches the users' intent. We introduce a User-intent Centrality Optimization (UCO) approach for existing models, which optimises for the user intent in semantic product search. To that end, we propose a dual-loss based optimisation to handle hard negatives, i.e., product titles that are semantically relevant but do not reflect the user's intent. Our contributions include curating challenging evaluation sets and implementing UCO, resulting in significant product ranking efficiency improvements observed for different evaluation metrics. Our work aims to ensure that the most buyer-centric titles for a query are ranked higher, thereby, enhancing the user experience on e-commerce platforms.

* EMNLP 2024: Industry track

Via

Access Paper or Ask Questions

Sampling Strategies for Creation of a Benchmark for Dialectal Sentiment Classification

Oct 15, 2024

Dipankar Srirag, Jordan Painter, Aditya Joshi, Diptesh Kanojia

Abstract:This paper investigates data sampling strategies to create a benchmark for dialectal sentiment classification of Google Places reviews written in English. Based on location-based filtering, we collect a self-supervised dataset of reviews in Australian (Australian English), Indian (Indian English), and British (British English) English with self-supervised sentiment labels (1-star to 5-star). We employ sampling techniques based on label semantics, review length, and sentiment proportion and report performances on three fine-tuned BERT-based models. Our multi-dialect evaluation provides pointers to challenging scenarios for inner-circle (Australian English and British English) as well as non-native dialects (Indian English) of English, highlighting the need for more diverse benchmarks.

* Under review

Via

Access Paper or Ask Questions

Edit Distances and Their Applications to Downstream Tasks in Research and Commercial Contexts

Oct 08, 2024

Félix do Carmo, Diptesh Kanojia

Abstract:The tutorial describes the concept of edit distances applied to research and commercial contexts. We use Translation Edit Rate (TER), Levenshtein, Damerau-Levenshtein, Longest Common Subsequence and $n$-gram distances to demonstrate the frailty of statistical metrics when comparing text sequences. Our discussion disassembles them into their essential components. We discuss the centrality of four editing actions: insert, delete, replace and move words, and show their implementations in openly available packages and toolkits. The application of edit distances in downstream tasks often assumes that these accurately represent work done by post-editors and real errors that need to be corrected in MT output. We discuss how imperfect edit distances are in capturing the details of this error correction work and the implications for researchers and for commercial applications, of these uses of edit distances. In terms of commercial applications, we discuss their integration in computer-assisted translation tools and how the perception of the connection between edit distances and post-editor effort affects the definition of translator rates.

* Tutorial @ 16th AMTA Conference, 2024

Via

Access Paper or Ask Questions