Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Giuseppe Carenini

University of British Columbia

FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

Dec 09, 2024

Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji

Figure 1 for FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

Figure 2 for FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

Figure 3 for FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

Figure 4 for FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering

Abstract:Multimodal multihop question answering is a complex task that requires reasoning over multiple sources of information, such as images and text, to answer questions. While there has been significant progress in visual question answering, the multihop setting remains unexplored due to the lack of high-quality datasets. Current methods focus on single-hop question answering or a single modality, which makes them unsuitable for real-world scenarios such as analyzing multimodal educational materials, summarizing lengthy academic articles, or interpreting scientific studies that combine charts, images, and text. To address this gap, we propose a novel methodology, introducing the first framework for creating a high-quality dataset that enables training models for multimodal multihop question answering. Our approach consists of a 5-stage pipeline that involves acquiring relevant multimodal documents from Wikipedia, synthetically generating high-level questions and answers, and validating them through rigorous criteria to ensure quality data. We evaluate our methodology by training models on our synthesized dataset and testing on two benchmarks, our results demonstrate that, with an equal sample size, models trained on our synthesized data outperform those trained on human-collected data by 1.9 in exact match (EM) on average. We believe our data synthesis method will serve as a strong foundation for training and evaluating multimodal multihop question answering models.

* 20 pages, 11 figures, 10 tables, Submitted to CVPR 2025

Via

Access Paper or Ask Questions

Captioning Visualizations with Large Language Models (CVLLM): A Tutorial

Jun 27, 2024

Giuseppe Carenini, Jordon Johnson, Ali Salamatian

Figure 1 for Captioning Visualizations with Large Language Models (CVLLM): A Tutorial

Figure 2 for Captioning Visualizations with Large Language Models (CVLLM): A Tutorial

Figure 3 for Captioning Visualizations with Large Language Models (CVLLM): A Tutorial

Figure 4 for Captioning Visualizations with Large Language Models (CVLLM): A Tutorial

Abstract:Automatically captioning visualizations is not new, but recent advances in large language models(LLMs) open exciting new possibilities. In this tutorial, after providing a brief review of Information Visualization (InfoVis) principles and past work in captioning, we introduce neural models and the transformer architecture used in generic LLMs. We then discuss their recent applications in InfoVis, with a focus on captioning. Additionally, we explore promising future directions in this field.

* 6 pages, 4 figures

Via

Access Paper or Ask Questions

DeTriever: Decoder-representation-based Retriever for Improving NL2SQL In-Context Learning

Jun 12, 2024

Yuxi Feng, Raymond Li, Zhenan Fan, Giuseppe Carenini, Mohammadreza Pourreza, Weiwei Zhang, Yong Zhang

Figure 1 for DeTriever: Decoder-representation-based Retriever for Improving NL2SQL In-Context Learning

Figure 2 for DeTriever: Decoder-representation-based Retriever for Improving NL2SQL In-Context Learning

Figure 3 for DeTriever: Decoder-representation-based Retriever for Improving NL2SQL In-Context Learning

Figure 4 for DeTriever: Decoder-representation-based Retriever for Improving NL2SQL In-Context Learning

Abstract:While in-context Learning (ICL) has proven to be an effective technique to improve the performance of Large Language Models (LLMs) in a variety of complex tasks, notably in translating natural language questions into Structured Query Language (NL2SQL), the question of how to select the most beneficial demonstration examples remains an open research problem. While prior works often adapted off-the-shelf encoders to retrieve examples dynamically, an inherent discrepancy exists in the representational capacities between the external retrievers and the LLMs. Further, optimizing the selection of examples is a non-trivial task, since there are no straightforward methods to assess the relative benefits of examples without performing pairwise inference. To address these shortcomings, we propose DeTriever, a novel demonstration retrieval framework that learns a weighted combination of LLM hidden states, where rich semantic information is encoded. To train the model, we propose a proxy score that estimates the relative benefits of examples based on the similarities between output queries. Experiments on two popular NL2SQL benchmarks demonstrate that our method significantly outperforms the state-of-the-art baselines on one-shot NL2SQL tasks.

Via

Access Paper or Ask Questions

BCAmirs at SemEval-2024 Task 4: Beyond Words: A Multimodal and Multilingual Exploration of Persuasion in Memes

Apr 03, 2024

Amirhossein Abaskohi, Amirhossein Dabiriaghdam, Lele Wang, Giuseppe Carenini

Figure 1 for BCAmirs at SemEval-2024 Task 4: Beyond Words: A Multimodal and Multilingual Exploration of Persuasion in Memes

Figure 2 for BCAmirs at SemEval-2024 Task 4: Beyond Words: A Multimodal and Multilingual Exploration of Persuasion in Memes

Figure 3 for BCAmirs at SemEval-2024 Task 4: Beyond Words: A Multimodal and Multilingual Exploration of Persuasion in Memes

Figure 4 for BCAmirs at SemEval-2024 Task 4: Beyond Words: A Multimodal and Multilingual Exploration of Persuasion in Memes

Abstract:Memes, combining text and images, frequently use metaphors to convey persuasive messages, shaping public opinion. Motivated by this, our team engaged in SemEval-2024 Task 4, a hierarchical multi-label classification task designed to identify rhetorical and psychological persuasion techniques embedded within memes. To tackle this problem, we introduced a caption generation step to assess the modality gap and the impact of additional semantic information from images, which improved our result. Our best model utilizes GPT-4 generated captions alongside meme text to fine-tune RoBERTa as the text encoder and CLIP as the image encoder. It outperforms the baseline by a large margin in all 12 subtasks. In particular, it ranked in top-3 across all languages in Subtask 2a, and top-4 in Subtask 2b, demonstrating quantitatively strong performance. The improvement achieved by the introduced intermediate step is likely attributable to the metaphorical essence of images that challenges visual encoders. This highlights the potential for improving abstract visual semantics encoding.

* 11 pages, 5 tables, 2 figures, Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024) @ NAACL 2024

Via

Access Paper or Ask Questions

Neural Multimodal Topic Modeling: A Comprehensive Evaluation

Mar 26, 2024

Felipe González-Pizarro, Giuseppe Carenini

Figure 1 for Neural Multimodal Topic Modeling: A Comprehensive Evaluation

Figure 2 for Neural Multimodal Topic Modeling: A Comprehensive Evaluation

Figure 3 for Neural Multimodal Topic Modeling: A Comprehensive Evaluation

Figure 4 for Neural Multimodal Topic Modeling: A Comprehensive Evaluation

Abstract:Neural topic models can successfully find coherent and diverse topics in textual data. However, they are limited in dealing with multimodal datasets (e.g., images and text). This paper presents the first systematic and comprehensive evaluation of multimodal topic modeling of documents containing both text and images. In the process, we propose two novel topic modeling solutions and two novel evaluation metrics. Overall, our evaluation on an unprecedented rich and diverse collection of datasets indicates that both of our models generate coherent and diverse topics. Nevertheless, the extent to which one method outperforms the other depends on the metrics and dataset combinations, which suggests further exploration of hybrid solutions in the future. Notably, our succinct human evaluation aligns with the outcomes determined by our proposed metrics. This alignment not only reinforces the credibility of our metrics but also highlights the potential for their application in guiding future multimodal topic modeling endeavors.

* Camera-Ready for LREC-COLING 2024 (Long Paper)

Via

Access Paper or Ask Questions

Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Nov 30, 2023

Linzi Xing, Quan Tran, Fabian Caba, Franck Dernoncourt, Seunghyun Yoon, Zhaowen Wang, Trung Bui, Giuseppe Carenini

Figure 1 for Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Figure 2 for Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Figure 3 for Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Figure 4 for Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain Adaptation

Abstract:Video topic segmentation unveils the coarse-grained semantic structure underlying videos and is essential for other video understanding tasks. Given the recent surge in multi-modal, relying solely on a single modality is arguably insufficient. On the other hand, prior solutions for similar tasks like video scene/shot segmentation cater to short videos with clear visual shifts but falter for long videos with subtle changes, such as livestreams. In this paper, we introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames, bolstered by a cross-modal attention mechanism. Furthermore, we propose a dual-contrastive learning framework adhering to the unsupervised domain adaptation paradigm, enhancing our model's adaptability to longer, more semantically complex videos. Experiments on short and long video corpora demonstrate that our proposed solution, significantly surpasses baseline methods in terms of both accuracy and transferability, in both intra- and cross-domain settings.

* Accepted at the 30th International Conference on Multimedia Modeling (MMM 2024)

Via

Access Paper or Ask Questions

Tracing Influence at Scale: A Contrastive Learning Approach to Linking Public Comments and Regulator Responses

Nov 24, 2023

Linzi Xing, Brad Hackinen, Giuseppe Carenini

Abstract:U.S. Federal Regulators receive over one million comment letters each year from businesses, interest groups, and members of the public, all advocating for changes to proposed regulations. These comments are believed to have wide-ranging impacts on public policy. However, measuring the impact of specific comments is challenging because regulators are required to respond to comments but they do not have to specify which comments they are addressing. In this paper, we propose a simple yet effective solution to this problem by using an iterative contrastive method to train a neural model aiming for matching text from public comments to responses written by regulators. We demonstrate that our proposal substantially outperforms a set of selected text-matching baselines on a human-annotated test set. Furthermore, it delivers performance comparable to the most advanced gigantic language model (i.e., GPT-4), and is more cost-effective when handling comments and regulator responses matching in larger scale.

* Accepted to the Natural Legal Language Processing Workshop 2023 (NLLP 2023)

Via

Access Paper or Ask Questions

Visual Analytics for Generative Transformer Models

Nov 21, 2023

Raymond Li, Ruixin Yang, Wen Xiao, Ahmed AbuRaed, Gabriel Murray, Giuseppe Carenini

Figure 1 for Visual Analytics for Generative Transformer Models

Figure 2 for Visual Analytics for Generative Transformer Models

Figure 3 for Visual Analytics for Generative Transformer Models

Figure 4 for Visual Analytics for Generative Transformer Models

Abstract:While transformer-based models have achieved state-of-the-art results in a variety of classification and generation tasks, their black-box nature makes them challenging for interpretability. In this work, we present a novel visual analytical framework to support the analysis of transformer-based generative networks. In contrast to previous work, which has mainly focused on encoder-based models, our framework is one of the first dedicated to supporting the analysis of transformer-based encoder-decoder models and decoder-only models for generative and classification tasks. Hence, we offer an intuitive overview that allows the user to explore different facets of the model through interactive visualization. To demonstrate the feasibility and usefulness of our framework, we present three detailed case studies based on real-world NLP research problems.

* 6 pages (reference excluded), 7 figures

Via

Access Paper or Ask Questions

Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting Pre-trained Language Models

Oct 24, 2023

Raymond Li, Gabriel Murray, Giuseppe Carenini

Figure 1 for Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting Pre-trained Language Models

Figure 2 for Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting Pre-trained Language Models

Figure 3 for Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting Pre-trained Language Models

Figure 4 for Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting Pre-trained Language Models

Abstract:In this work, we propose a method that combines two popular research areas by injecting linguistic structures into pre-trained language models in the parameter-efficient fine-tuning (PEFT) setting. In our approach, parallel adapter modules encoding different linguistic structures are combined using a novel Mixture-of-Linguistic-Experts architecture, where Gumbel-Softmax gates are used to determine the importance of these modules at each layer of the model. To reduce the number of parameters, we first train the model for a fixed small number of steps before pruning the experts based on their importance scores. Our experiment results with three different pre-trained models show that our approach can outperform state-of-the-art PEFT methods with a comparable number of parameters. In addition, we provide additional analysis to examine the experts selected by each model at each layer to provide insights for future studies.

* 14 pages, 3 figures, Camera-Ready for EMNLP 2023 Findings (Long Paper)

Via

Access Paper or Ask Questions

Diversity-Aware Coherence Loss for Improving Neural Topic Models

May 26, 2023

Raymond Li, Felipe González-Pizarro, Linzi Xing, Gabriel Murray, Giuseppe Carenini

Figure 1 for Diversity-Aware Coherence Loss for Improving Neural Topic Models

Figure 2 for Diversity-Aware Coherence Loss for Improving Neural Topic Models

Figure 3 for Diversity-Aware Coherence Loss for Improving Neural Topic Models

Figure 4 for Diversity-Aware Coherence Loss for Improving Neural Topic Models

Abstract:The standard approach for neural topic modeling uses a variational autoencoder (VAE) framework that jointly minimizes the KL divergence between the estimated posterior and prior, in addition to the reconstruction loss. Since neural topic models are trained by recreating individual input documents, they do not explicitly capture the coherence between topic words on the corpus level. In this work, we propose a novel diversity-aware coherence loss that encourages the model to learn corpus-level coherence scores while maintaining a high diversity between topics. Experimental results on multiple datasets show that our method significantly improves the performance of neural topic models without requiring any pretraining or additional parameters.

* Minor Fixes, 11 pages, Camera-Ready for ACL 2023 (Short Paper)

Via

Access Paper or Ask Questions