Alert button
Picture for Tong Xu

Tong Xu

Alert button

Woodpecker: Hallucination Correction for Multimodal Large Language Models

Oct 24, 2023
Shukang Yin, Chaoyou Fu, Sirui Zhao, Tong Xu, Hao Wang, Dianbo Sui, Yunhang Shen, Ke Li, Xing Sun, Enhong Chen

Figure 1 for Woodpecker: Hallucination Correction for Multimodal Large Language Models
Figure 2 for Woodpecker: Hallucination Correction for Multimodal Large Language Models
Figure 3 for Woodpecker: Hallucination Correction for Multimodal Large Language Models
Figure 4 for Woodpecker: Hallucination Correction for Multimodal Large Language Models

Hallucination is a big shadow hanging over the rapidly evolving Multimodal Large Language Models (MLLMs), referring to the phenomenon that the generated text is inconsistent with the image content. In order to mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like a woodpecker heals trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.

* 16 pages, 7 figures. Code Website: https://github.com/BradyFU/Woodpecker 
Viaarxiv icon

CgT-GAN: CLIP-guided Text GAN for Image Captioning

Aug 23, 2023
Jiarui Yu, Haoran Li, Yanbin Hao, Bin Zhu, Tong Xu, Xiangnan He

Figure 1 for CgT-GAN: CLIP-guided Text GAN for Image Captioning
Figure 2 for CgT-GAN: CLIP-guided Text GAN for Image Captioning
Figure 3 for CgT-GAN: CLIP-guided Text GAN for Image Captioning
Figure 4 for CgT-GAN: CLIP-guided Text GAN for Image Captioning

The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements for text embeddings. Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to "see" real visual modality. Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance. The caption generator is jointly rewarded based on the caption naturalness to human language calculated from the GAN's discriminator and the semantic guidance reward computed by the CLIP-based reward module. In addition to the cosine similarity as the semantic guidance reward (i.e., CLIP-cos), we further introduce a novel semantic guidance reward called CLIP-agg, which aligns the generated caption with a weighted text embedding by attentively aggregating the entire corpus. Experimental results on three subtasks (ZS-IC, In-UIC and Cross-UIC) show that CgT-GAN outperforms state-of-the-art methods significantly across all metrics. Code is available at https://github.com/Lihr747/CgtGAN.

* Accepted at ACM MM 2023 
Viaarxiv icon

Multi-Grained Multimodal Interaction Network for Entity Linking

Jul 19, 2023
Pengfei Luo, Tong Xu, Shiwei Wu, Chen Zhu, Linli Xu, Enhong Chen

Figure 1 for Multi-Grained Multimodal Interaction Network for Entity Linking
Figure 2 for Multi-Grained Multimodal Interaction Network for Entity Linking
Figure 3 for Multi-Grained Multimodal Interaction Network for Entity Linking
Figure 4 for Multi-Grained Multimodal Interaction Network for Entity Linking

Multimodal entity linking (MEL) task, which aims at resolving ambiguous mentions to a multimodal knowledge graph, has attracted wide attention in recent years. Though large efforts have been made to explore the complementary effect among multiple modalities, however, they may fail to fully absorb the comprehensive expression of abbreviated textual context and implicit visual indication. Even worse, the inevitable noisy data may cause inconsistency of different modalities during the learning process, which severely degenerates the performance. To address the above issues, in this paper, we propose a novel Multi-GraIned Multimodal InteraCtion Network $\textbf{(MIMIC)}$ framework for solving the MEL task. Specifically, the unified inputs of mentions and entities are first encoded by textual/visual encoders separately, to extract global descriptive features and local detailed features. Then, to derive the similarity matching score for each mention-entity pair, we device three interaction units to comprehensively explore the intra-modal interaction and inter-modal fusion among features of entities and mentions. In particular, three modules, namely the Text-based Global-Local interaction Unit (TGLU), Vision-based DuaL interaction Unit (VDLU) and Cross-Modal Fusion-based interaction Unit (CMFU) are designed to capture and integrate the fine-grained representation lying in abbreviated text and implicit visual cues. Afterwards, we introduce a unit-consistency objective function via contrastive learning to avoid inconsistency and model degradation. Experimental results on three public benchmark datasets demonstrate that our solution outperforms various state-of-the-art baselines, and ablation studies verify the effectiveness of designed modules.

* Accepted by KDD 2023 
Viaarxiv icon

A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference

Jun 26, 2023
Chao Zhang, Shiwei Wu, Sirui Zhao, Tong Xu, Enhong Chen

Figure 1 for A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference
Figure 2 for A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference
Figure 3 for A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference
Figure 4 for A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference

Affordance-centric Question-driven Task Completion (AQTC) for Egocentric Assistant introduces a groundbreaking scenario. In this scenario, through learning instructional videos, AI assistants provide users with step-by-step guidance on operating devices. In this paper, we present a solution for enhancing video alignment to improve multi-step inference. Specifically, we first utilize VideoCLIP to generate video-script alignment features. Afterwards, we ground the question-relevant content in instructional videos. Then, we reweight the multimodal context to emphasize prominent features. Finally, we adopt GRU to conduct multi-step inference. Through comprehensive experiments, we demonstrate the effectiveness and superiority of our method, which secured the 2nd place in CVPR'2023 AQTC challenge. Our code is available at https://github.com/zcfinal/LOVEU-CVPR23-AQTC.

* 5 pages, 1 figure, technical report for track3 of CVPR 2023 LOVEU challenge 
Viaarxiv icon

A Survey on Multimodal Large Language Models

Jun 23, 2023
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen

Figure 1 for A Survey on Multimodal Large Language Models
Figure 2 for A Survey on Multimodal Large Language Models
Figure 3 for A Survey on Multimodal Large Language Models
Figure 4 for A Survey on Multimodal Large Language Models

Multimodal Large Language Model (MLLM) recently has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional methods, suggesting a potential path to artificial general intelligence. In this paper, we aim to trace and summarize the recent progress of MLLM. First of all, we present the formulation of MLLM and delineate its related concepts. Then, we discuss the key techniques and applications, including Multimodal Instruction Tuning (M-IT), Multimodal In-Context Learning (M-ICL), Multimodal Chain of Thought (M-CoT), and LLM-Aided Visual Reasoning (LAVR). Finally, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

* Project page:https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models 
Viaarxiv icon

Spatial Heterophily Aware Graph Neural Networks

Jun 21, 2023
Congxi Xiao, Jingbo Zhou, Jizhou Huang, Tong Xu, Hui Xiong

Figure 1 for Spatial Heterophily Aware Graph Neural Networks
Figure 2 for Spatial Heterophily Aware Graph Neural Networks
Figure 3 for Spatial Heterophily Aware Graph Neural Networks
Figure 4 for Spatial Heterophily Aware Graph Neural Networks

Graph Neural Networks (GNNs) have been broadly applied in many urban applications upon formulating a city as an urban graph whose nodes are urban objects like regions or points of interest. Recently, a few enhanced GNN architectures have been developed to tackle heterophily graphs where connected nodes are dissimilar. However, urban graphs usually can be observed to possess a unique spatial heterophily property; that is, the dissimilarity of neighbors at different spatial distances can exhibit great diversity. This property has not been explored, while it often exists. To this end, in this paper, we propose a metric, named Spatial Diversity Score, to quantitatively measure the spatial heterophily and show how it can influence the performance of GNNs. Indeed, our experimental investigation clearly shows that existing heterophilic GNNs are still deficient in handling the urban graph with high spatial diversity score. This, in turn, may degrade their effectiveness in urban applications. Along this line, we propose a Spatial Heterophily Aware Graph Neural Network (SHGNN), to tackle the spatial diversity of heterophily of urban graphs. Based on the key observation that spatially close neighbors on the urban graph present a more similar mode of difference to the central node, we first design a rotation-scaling spatial aggregation module, whose core idea is to properly group the spatially close neighbors and separately process each group with less diversity inside. Then, a heterophily-sensitive spatial interaction module is designed to adaptively capture the commonality and diverse dissimilarity in different spatial groups. Extensive experiments on three real-world urban datasets demonstrate the superiority of our SHGNN over several its competitors.

* Accepted by KDD 2023 
Viaarxiv icon

Multi-Temporal Relationship Inference in Urban Areas

Jun 15, 2023
Shuangli Li, Jingbo Zhou, Ji Liu, Tong Xu, Enhong Chen, Hui Xiong

Figure 1 for Multi-Temporal Relationship Inference in Urban Areas
Figure 2 for Multi-Temporal Relationship Inference in Urban Areas
Figure 3 for Multi-Temporal Relationship Inference in Urban Areas
Figure 4 for Multi-Temporal Relationship Inference in Urban Areas

Finding multiple temporal relationships among locations can benefit a bunch of urban applications, such as dynamic offline advertising and smart public transport planning. While some efforts have been made on finding static relationships among locations, little attention is focused on studying time-aware location relationships. Indeed, abundant location-based human activities are time-varying and the availability of these data enables a new paradigm for understanding the dynamic relationships in a period among connective locations. To this end, we propose to study a new problem, namely multi-Temporal relationship inference among locations (Trial for short), where the major challenge is how to integrate dynamic and geographical influence under the relationship sparsity constraint. Specifically, we propose a solution to Trial with a graph learning scheme, which includes a spatially evolving graph neural network (SEENet) with two collaborative components: spatially evolving graph convolution module (SEConv) and spatially evolving self-supervised learning strategy (SE-SSL). SEConv performs the intra-time aggregation and inter-time propagation to capture the multifaceted spatially evolving contexts from the view of location message passing. In addition, SE-SSL designs time-aware self-supervised learning tasks in a global-local manner with additional evolving constraint to enhance the location representation learning and further handle the relationship sparsity. Finally, experiments on four real-world datasets demonstrate the superiority of our method over several state-of-the-art approaches.

* Accepted by KDD 2023. Code and data: https://github.com/agave233/SEENet 
Viaarxiv icon

Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation

May 25, 2023
Tong Xu, Micol Spitale, Hao Tang, Lu Liu, Hatice Gunes, Siyang Song

Figure 1 for Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation
Figure 2 for Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation
Figure 3 for Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation
Figure 4 for Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation

Generating facial reactions in a human-human dyadic interaction is complex and highly dependent on the context since more than one facial reactions can be appropriate for the speaker's behaviour. This has challenged existing machine learning (ML) methods, whose training strategies enforce models to reproduce a specific (not multiple) facial reaction from each input speaker behaviour. This paper proposes the first multiple appropriate facial reaction generation framework that re-formulates the one-to-many mapping facial reaction generation problem as a one-to-one mapping problem. This means that we approach this problem by considering the generation of a distribution of the listener's appropriate facial reactions instead of multiple different appropriate facial reactions, i.e., 'many' appropriate facial reaction labels are summarised as 'one' distribution label during training. Our model consists of a perceptual processor, a cognitive processor, and a motor processor. The motor processor is implemented with a novel Reversible Multi-dimensional Edge Graph Neural Network (REGNN). This allows us to obtain a distribution of appropriate real facial reactions during the training process, enabling the cognitive processor to be trained to predict the appropriate facial reaction distribution. At the inference stage, the REGNN decodes an appropriate facial reaction by using this distribution as input. Experimental results demonstrate that our approach outperforms existing models in generating more appropriate, realistic, and synchronized facial reactions. The improved performance is largely attributed to the proposed appropriate facial reaction distribution learning strategy and the use of a REGNN. The code is available at https://github.com/TongXu-05/REGNN-Multiple-Appropriate-Facial-Reaction-Generation.

Viaarxiv icon

Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark

May 17, 2023
Wenjun Peng, Jingwei Yi, Fangzhao Wu, Shangxi Wu, Bin Zhu, Lingjuan Lyu, Binxing Jiao, Tong Xu, Guangzhong Sun, Xing Xie

Figure 1 for Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark
Figure 2 for Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark
Figure 3 for Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark
Figure 4 for Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark

Large language models (LLMs) have demonstrated powerful capabilities in both text understanding and generation. Companies have begun to offer Embedding as a Service (EaaS) based on these LLMs, which can benefit various natural language processing (NLP) tasks for customers. However, previous studies have shown that EaaS is vulnerable to model extraction attacks, which can cause significant losses for the owners of LLMs, as training these models is extremely expensive. To protect the copyright of LLMs for EaaS, we propose an Embedding Watermark method called EmbMarker that implants backdoors on embeddings. Our method selects a group of moderate-frequency words from a general text corpus to form a trigger set, then selects a target embedding as the watermark, and inserts it into the embeddings of texts containing trigger words as the backdoor. The weight of insertion is proportional to the number of trigger words included in the text. This allows the watermark backdoor to be effectively transferred to EaaS-stealer's model for copyright verification while minimizing the adverse impact on the original embeddings' utility. Our extensive experiments on various datasets show that our method can effectively protect the copyright of EaaS models without compromising service quality.

* Accepted by ACL 2023 
Viaarxiv icon