Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexandros Xenos

Discriminative Fine-tuning of LVLMs

Dec 05, 2024

Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Georgios Tzimiropoulos, Brais Martinez

Figure 1 for Discriminative Fine-tuning of LVLMs

Figure 2 for Discriminative Fine-tuning of LVLMs

Figure 3 for Discriminative Fine-tuning of LVLMs

Figure 4 for Discriminative Fine-tuning of LVLMs

Abstract:Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include: (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

* Preprint. The first two authors contributed equally

Via

Access Paper or Ask Questions

Simplifying complex machine learning by linearly separable network embedding spaces

Oct 02, 2024

Alexandros Xenos, Noel-Malod Dognin, Natasa Przulj

Figure 1 for Simplifying complex machine learning by linearly separable network embedding spaces

Figure 2 for Simplifying complex machine learning by linearly separable network embedding spaces

Figure 3 for Simplifying complex machine learning by linearly separable network embedding spaces

Figure 4 for Simplifying complex machine learning by linearly separable network embedding spaces

Abstract:Low-dimensional embeddings are a cornerstone in the modelling and analysis of complex networks. However, most existing approaches for mining network embedding spaces rely on computationally intensive machine learning systems to facilitate downstream tasks. In the field of NLP, word embedding spaces capture semantic relationships \textit{linearly}, allowing for information retrieval using \textit{simple linear operations} on word embedding vectors. Here, we demonstrate that there are structural properties of network data that yields this linearity. We show that the more homophilic the network representation, the more linearly separable the corresponding network embedding space, yielding better downstream analysis results. Hence, we introduce novel graphlet-based methods enabling embedding of networks into more linearly separable spaces, allowing for their better mining. Our fundamental insights into the structure of network data that enable their \textit{\textbf{linear}} mining and exploitation enable the ML community to build upon, towards efficiently and explainably mining of the complex network data.

* 26 pages, 8 figures

Via

Access Paper or Ask Questions

VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning

Apr 10, 2024

Alexandros Xenos, Niki Maria Foteinopoulou, Ioanna Ntinou, Ioannis Patras, Georgios Tzimiropoulos

Figure 1 for VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning

Figure 2 for VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning

Figure 3 for VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning

Figure 4 for VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning

Abstract:Recognising emotions in context involves identifying the apparent emotions of an individual, taking into account contextual cues from the surrounding scene. Previous approaches to this task have involved the design of explicit scene-encoding architectures or the incorporation of external scene-related information, such as captions. However, these methods often utilise limited contextual information or rely on intricate training pipelines. In this work, we leverage the groundbreaking capabilities of Vision-and-Large-Language Models (VLLMs) to enhance in-context emotion classification without introducing complexity to the training process in a two-stage approach. In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion relative to the visual context. In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture that fuses text and visual features before the final classification task. Our experimental results show that the text and image features have complementary information, and our fused architecture significantly outperforms the individual modalities without any complex training methods. We evaluate our approach on three different datasets, namely, EMOTIC, CAER-S, and BoLD, and achieve state-of-the-art or comparable accuracy across all datasets and metrics compared to much more complex approaches. The code will be made publicly available on github: https://github.com/NickyFot/EmoCommonSense.git

* A. Xenos, N. Foteinopoulou and I. Ntinou contributed equally to this work; 14 pages, 5 figures

Via

Access Paper or Ask Questions

A Simple Baseline for Knowledge-Based Visual Question Answering

Oct 24, 2023

Alexandros Xenos, Themos Stafylakis, Ioannis Patras, Georgios Tzimiropoulos

Figure 1 for A Simple Baseline for Knowledge-Based Visual Question Answering

Figure 2 for A Simple Baseline for Knowledge-Based Visual Question Answering

Figure 3 for A Simple Baseline for Knowledge-Based Visual Question Answering

Figure 4 for A Simple Baseline for Knowledge-Based Visual Question Answering

Abstract:This paper is on the problem of Knowledge-Based Visual Question Answering (KB-VQA). Recent works have emphasized the significance of incorporating both explicit (through external databases) and implicit (through LLMs) knowledge to answer questions requiring external knowledge effectively. A common limitation of such approaches is that they consist of relatively complicated pipelines and often heavily rely on accessing GPT-3 API. Our main contribution in this paper is to propose a much simpler and readily reproducible pipeline which, in a nutshell, is based on efficient in-context learning by prompting LLaMA (1 and 2) using question-informative captions as contextual information. Contrary to recent approaches, our method is training-free, does not require access to external databases or APIs, and yet achieves state-of-the-art accuracy on the OK-VQA and A-OK-VQA datasets. Finally, we perform several ablation studies to understand important aspects of our method. Our code is publicly available at https://github.com/alexandrosXe/ASimple-Baseline-For-Knowledge-Based-VQA

* Accepted at EMNLP 2023 (camera-ready version)

Via

Access Paper or Ask Questions

Toxicity Detection can be Sensitive to the Conversational Context

Nov 19, 2021

Alexandros Xenos, John Pavlopoulos, Ion Androutsopoulos, Lucas Dixon, Jeffrey Sorensen, Leo Laugier

Figure 1 for Toxicity Detection can be Sensitive to the Conversational Context

Figure 2 for Toxicity Detection can be Sensitive to the Conversational Context

Figure 3 for Toxicity Detection can be Sensitive to the Conversational Context

Figure 4 for Toxicity Detection can be Sensitive to the Conversational Context

Abstract:User posts whose perceived toxicity depends on the conversational context are rare in current toxicity detection datasets. Hence, toxicity detectors trained on existing datasets will also tend to disregard context, making the detection of context-sensitive toxicity harder when it does occur. We construct and publicly release a dataset of 10,000 posts with two kinds of toxicity labels: (i) annotators considered each post with the previous one as context; and (ii) annotators had no additional context. Based on this, we introduce a new task, context sensitivity estimation, which aims to identify posts whose perceived toxicity changes if the context (previous post) is also considered. We then evaluate machine learning systems on this task, showing that classifiers of practical quality can be developed, and we show that data augmentation with knowledge distillation can improve the performance further. Such systems could be used to enhance toxicity detection datasets with more context-dependent posts, or to suggest when moderators should consider the parent posts, which often may be unnecessary and may otherwise introduce significant additional cost.

* 13 pages, 8 figures

Via

Access Paper or Ask Questions