Alert button
Picture for Xuemeng Song

Xuemeng Song

Alert button

Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation

Jun 29, 2023
Liqiang Jing, Xuemeng Song, Kun Ouyang, Mengzhao Jia, Liqiang Nie

Figure 1 for Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation
Figure 2 for Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation
Figure 3 for Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation
Figure 4 for Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation

Multimodal Sarcasm Explanation (MuSE) is a new yet challenging task, which aims to generate a natural language sentence for a multimodal social post (an image as well as its caption) to explain why it contains sarcasm. Although the existing pioneer study has achieved great success with the BART backbone, it overlooks the gap between the visual feature space and the decoder semantic space, the object-level metadata of the image, as well as the potential external knowledge. To solve these limitations, in this work, we propose a novel mulTi-source sEmantic grAph-based Multimodal sarcasm explanation scheme, named TEAM. In particular, TEAM extracts the object-level semantic meta-data instead of the traditional global visual features from the input image. Meanwhile, TEAM resorts to ConceptNet to obtain the external related knowledge concepts for the input text and the extracted object meta-data. Thereafter, TEAM introduces a multi-source semantic graph that comprehensively characterize the multi-source (i.e., caption, object meta-data, external knowledge) semantic relations to facilitate the sarcasm reasoning. Extensive experiments on a public released dataset MORE verify the superiority of our model over cutting-edge methods.

* ACL 2023  
* Accepted by ACL 2023 main conference 
Viaarxiv icon

Dual Semantic Knowledge Composed Multimodal Dialog Systems

May 17, 2023
Xiaolin Chen, Xuemeng Song, Yinwei Wei, Liqiang Nie, Tat-Seng Chua

Figure 1 for Dual Semantic Knowledge Composed Multimodal Dialog Systems
Figure 2 for Dual Semantic Knowledge Composed Multimodal Dialog Systems
Figure 3 for Dual Semantic Knowledge Composed Multimodal Dialog Systems
Figure 4 for Dual Semantic Knowledge Composed Multimodal Dialog Systems

Textual response generation is an essential task for multimodal task-oriented dialog systems.Although existing studies have achieved fruitful progress, they still suffer from two critical limitations: 1) focusing on the attribute knowledge but ignoring the relation knowledge that can reveal the correlations between different entities and hence promote the response generation}, and 2) only conducting the cross-entropy loss based output-level supervision but lacking the representation-level regularization. To address these limitations, we devise a novel multimodal task-oriented dialog system (named MDS-S2). Specifically, MDS-S2 first simultaneously acquires the context related attribute and relation knowledge from the knowledge base, whereby the non-intuitive relation knowledge is extracted by the n-hop graph walk. Thereafter, considering that the attribute knowledge and relation knowledge can benefit the responding to different levels of questions, we design a multi-level knowledge composition module in MDS-S2 to obtain the latent composed response representation. Moreover, we devise a set of latent query variables to distill the semantic information from the composed response representation and the ground truth response representation, respectively, and thus conduct the representation-level semantic regularization. Extensive experiments on a public dataset have verified the superiority of our proposed MDS-S2. We have released the codes and parameters to facilitate the research community.

* SIGIR 2023 
Viaarxiv icon

Stylized Data-to-Text Generation: A Case Study in the E-Commerce Domain

May 05, 2023
Liqiang Jing, Xuemeng Song, Xuming Lin, Zhongzhou Zhao, Wei Zhou, Liqiang Nie

Figure 1 for Stylized Data-to-Text Generation: A Case Study in the E-Commerce Domain
Figure 2 for Stylized Data-to-Text Generation: A Case Study in the E-Commerce Domain
Figure 3 for Stylized Data-to-Text Generation: A Case Study in the E-Commerce Domain
Figure 4 for Stylized Data-to-Text Generation: A Case Study in the E-Commerce Domain

Existing data-to-text generation efforts mainly focus on generating a coherent text from non-linguistic input data, such as tables and attribute-value pairs, but overlook that different application scenarios may require texts of different styles. Inspired by this, we define a new task, namely stylized data-to-text generation, whose aim is to generate coherent text for the given non-linguistic data according to a specific style. This task is non-trivial, due to three challenges: the logic of the generated text, unstructured style reference, and biased training samples. To address these challenges, we propose a novel stylized data-to-text generation model, named StyleD2T, comprising three components: logic planning-enhanced data embedding, mask-based style embedding, and unbiased stylized text generation. In the first component, we introduce a graph-guided logic planner for attribute organization to ensure the logic of generated text. In the second component, we devise feature-level mask-based style embedding to extract the essential style signal from the given unstructured style reference. In the last one, pseudo triplet augmentation is utilized to achieve unbiased text generation, and a multi-condition based confidence assignment function is designed to ensure the quality of pseudo samples. Extensive experiments on a newly collected dataset from Taobao have been conducted, and the results show the superiority of our model over existing methods.

Viaarxiv icon

MMNet: Multi-modal Fusion with Mutual Learning Network for Fake News Detection

Dec 12, 2022
Linmei Hu, Ziwang Zhao, Xinkai Ge, Xuemeng Song, Liqiang Nie

Figure 1 for MMNet: Multi-modal Fusion with Mutual Learning Network for Fake News Detection
Figure 2 for MMNet: Multi-modal Fusion with Mutual Learning Network for Fake News Detection
Figure 3 for MMNet: Multi-modal Fusion with Mutual Learning Network for Fake News Detection
Figure 4 for MMNet: Multi-modal Fusion with Mutual Learning Network for Fake News Detection

The rapid development of social media provides a hotbed for the dissemination of fake news, which misleads readers and causes negative effects on society. News usually involves texts and images to be more vivid. Consequently, multi-modal fake news detection has received wide attention. Prior efforts primarily conduct multi-modal fusion by simple concatenation or co-attention mechanism, leading to sub-optimal performance. In this paper, we propose a novel mutual learning network based model MMNet, which enhances the multi-modal fusion for fake news detection via mutual learning between text- and vision-centered views towards the same classification objective. Specifically, we design two detection modules respectively based on text- and vision-centered multi-modal fusion features, and enable the mutual learning of the two modules to facilitate the multi-modal fusion, considering the latent consistency between the two modules towards the same training objective. Moreover, we also consider the influence of the image-text matching degree on news authenticity judgement by designing an image-text matching aware co-attention mechanism for multi-modal fusion. Extensive experiments are conducted on three benchmark datasets and the results demonstrate that our proposed MMNet achieves superior performance in fake news detection.

Viaarxiv icon

Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis

Jul 24, 2022
Teng Sun, Wenjie Wang, Liqiang Jing, Yiran Cui, Xuemeng Song, Liqiang Nie

Figure 1 for Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis
Figure 2 for Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis
Figure 3 for Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis
Figure 4 for Counterfactual Reasoning for Out-of-distribution Multimodal Sentiment Analysis

Existing studies on multimodal sentiment analysis heavily rely on textual modality and unavoidably induce the spurious correlations between textual words and sentiment labels. This greatly hinders the model generalization ability. To address this problem, we define the task of out-of-distribution (OOD) multimodal sentiment analysis. This task aims to estimate and mitigate the bad effect of textual modality for strong OOD generalization. To this end, we embrace causal inference, which inspects the causal relationships via a causal graph. From the graph, we find that the spurious correlations are attributed to the direct effect of textual modality on the model prediction while the indirect one is more reliable by considering multimodal semantics. Inspired by this, we devise a model-agnostic counterfactual framework for multimodal sentiment analysis, which captures the direct effect of textual modality via an extra text model and estimates the indirect one by a multimodal model. During the inference, we first estimate the direct effect by the counterfactual inference, and then subtract it from the total effect of all modalities to obtain the indirect effect for reliable prediction. Extensive experiments show the superior effectiveness and generalization ability of our proposed framework.

Viaarxiv icon

Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Jul 16, 2022
Xiaolin Chen, Xuemeng Song, Liqiang Jing, Shuo Li, Linmei Hu, Liqiang Nie

Figure 1 for Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model
Figure 2 for Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model
Figure 3 for Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model
Figure 4 for Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model

Text response generation for multimodal task-oriented dialog systems, which aims to generate the proper text response given the multimodal context, is an essential yet challenging task. Although existing efforts have achieved compelling success, they still suffer from two pivotal limitations: 1) overlook the benefit of generative pre-training, and 2) ignore the textual context related knowledge. To address these limitations, we propose a novel dual knowledge-enhanced generative pretrained language model for multimodal task-oriented dialog systems (DKMD), consisting of three key components: dual knowledge selection, dual knowledge-enhanced context learning, and knowledge-enhanced response generation. To be specific, the dual knowledge selection component aims to select the related knowledge according to both textual and visual modalities of the given context. Thereafter, the dual knowledge-enhanced context learning component targets seamlessly integrating the selected knowledge into the multimodal context learning from both global and local perspectives, where the cross-modal semantic relation is also explored. Moreover, the knowledge-enhanced response generation component comprises a revised BART decoder, where an additional dot-product knowledge-decoder attention sub-layer is introduced for explicitly utilizing the knowledge to advance the text response generation. Extensive experiments on a public dataset verify the superiority of the proposed DKMD over state-of-the-art competitors.

Viaarxiv icon

Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation

Apr 02, 2022
Xingning Dong, Tian Gan, Xuemeng Song, Jianlong Wu, Yuan Cheng, Liqiang Nie

Figure 1 for Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation
Figure 2 for Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation
Figure 3 for Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation
Figure 4 for Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation

Scene Graph Generation, which generally follows a regular encoder-decoder pipeline, aims to first encode the visual contents within the given image and then parse them into a compact summary graph. Existing SGG approaches generally not only neglect the insufficient modality fusion between vision and language, but also fail to provide informative predicates due to the biased relationship predictions, leading SGG far from practical. Towards this end, in this paper, we first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the inter-modal interaction, to serve as the encoder. We then devise an innovative Group Collaborative Learning strategy to optimize the decoder. Particularly, based upon the observation that the recognition capability of one classifier is limited towards an extremely unbalanced dataset, we first deploy a group of classifiers that are expert in distinguishing different subsets of classes, and then cooperatively optimize them from two aspects to promote the unbiased SGG. Experiments conducted on VG and GQA datasets demonstrate that, we not only establish a new state-of-the-art in the unbiased metric, but also nearly double the performance compared with two baselines.

* Accepted by CVPR 2022, the code is available at https://github.com/dongxingning/SHA-GCL-for-SGG 
Viaarxiv icon

MERIt: Meta-Path Guided Contrastive Learning for Logical Reasoning

Mar 01, 2022
Fangkai Jiao, Yangyang Guo, Xuemeng Song, Liqiang Nie

Figure 1 for MERIt: Meta-Path Guided Contrastive Learning for Logical Reasoning
Figure 2 for MERIt: Meta-Path Guided Contrastive Learning for Logical Reasoning
Figure 3 for MERIt: Meta-Path Guided Contrastive Learning for Logical Reasoning
Figure 4 for MERIt: Meta-Path Guided Contrastive Learning for Logical Reasoning

Logical reasoning is of vital importance to natural language understanding. Previous studies either employ graph-based models to incorporate prior knowledge about logical relations, or introduce symbolic logic into neural models through data augmentation. These methods, however, heavily depend on annotated training data, and thus suffer from over-fitting and poor generalization problems due to the dataset sparsity. To address these two problems, in this paper, we propose MERIt, a MEta-path guided contrastive learning method for logical ReasonIng of text, to perform self-supervised pre-training on abundant unlabeled text data. Two novel strategies serve as indispensable components of our method. In particular, a strategy based on meta-path is devised to discover the logical structure in natural texts, followed by a counterfactual data augmentation strategy to eliminate the information shortcut induced by pre-training. The experimental results on two challenging logical reasoning benchmarks, i.e., ReClor and LogiQA, demonstrate that our method outperforms the SOTA baselines with significant improvements.

* 14 pages, 6 figures, Findings of ACL 2022 
Viaarxiv icon

Dual Preference Distribution Learning for Item Recommendation

Jan 24, 2022
Xue Dong, Xuemeng Song, Na Zheng, Yinwei Wei, Zhongzhou Zhao, Hongjun Dai

Figure 1 for Dual Preference Distribution Learning for Item Recommendation
Figure 2 for Dual Preference Distribution Learning for Item Recommendation
Figure 3 for Dual Preference Distribution Learning for Item Recommendation
Figure 4 for Dual Preference Distribution Learning for Item Recommendation

Recommender systems can automatically recommend users items that they probably like, for which the goal is to represent the user and item as well as model their interaction. Existing methods have primarily learned the user's preferences and item's features with vectorized representations, and modeled the user-item interaction by the similarity of their representations. In fact, the user's different preferences are related and capturing such relations could better understand the user's preferences for a better recommendation. Toward this end, we propose to represent the user's preference with multi-variant Gaussian distribution, and model the user-item interaction by calculating the probability density at the item in the user's preference distribution. In this manner, the mean vector of the Gaussian distribution is able to capture the center of the user's preferences, while its covariance matrix captures the relations of these preferences. In particular, in this work, we propose a dual preference distribution learning framework (DUPLE), which captures the user's preferences to both the items and attributes by a Gaussian distribution, respectively. As a byproduct, identifying the user's preference to specific attributes enables us to provide the explanation of recommending an item to the user. Extensive quantitative and qualitative experiments on six public datasets show that DUPLE achieves the best performance over all state-of-the-art recommendation methods.

* 11 pages, 5 figures. This manuscript has been submitted to IEEE TKDE 
Viaarxiv icon

Hierarchical Deep Residual Reasoning for Temporal Moment Localization

Oct 31, 2021
Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, Liqiang Nie

Figure 1 for Hierarchical Deep Residual Reasoning for Temporal Moment Localization
Figure 2 for Hierarchical Deep Residual Reasoning for Temporal Moment Localization
Figure 3 for Hierarchical Deep Residual Reasoning for Temporal Moment Localization
Figure 4 for Hierarchical Deep Residual Reasoning for Temporal Moment Localization

Temporal Moment Localization (TML) in untrimmed videos is a challenging task in the field of multimedia, which aims at localizing the start and end points of the activity in the video, described by a sentence query. Existing methods mainly focus on mining the correlation between video and sentence representations or investigating the fusion manner of the two modalities. These works mainly understand the video and sentence coarsely, ignoring the fact that a sentence can be understood from various semantics, and the dominant words affecting the moment localization in the semantics are the action and object reference. Toward this end, we propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization. Furthermore, considering that videos with different resolution and sentences with different length have different difficulty in understanding, we design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner. Extensive experiments conducted on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our HDRR model compared with other state-of-the-art methods.

Viaarxiv icon