Alert button
Picture for Xinyu Dai

Xinyu Dai

Alert button

FOAL: Fine-grained Contrastive Learning for Cross-domain Aspect Sentiment Triplet Extraction

Nov 17, 2023
Ting Xu, Zhen Wu, Huiyun Yang, Xinyu Dai

Aspect Sentiment Triplet Extraction (ASTE) has achieved promising results while relying on sufficient annotation data in a specific domain. However, it is infeasible to annotate data for each individual domain. We propose to explore ASTE in the cross-domain setting, which transfers knowledge from a resource-rich source domain to a resource-poor target domain, thereby alleviating the reliance on labeled data in the target domain. To effectively transfer the knowledge across domains and extract the sentiment triplets accurately, we propose a method named Fine-grained cOntrAstive Learning (FOAL) to reduce the domain discrepancy and preserve the discriminability of each category. Experiments on six transfer pairs show that FOAL achieves 6% performance gains and reduces the domain discrepancy significantly compared with strong baselines. Our code will be publicly available once accepted.

Viaarxiv icon

M2DF: Multi-grained Multi-curriculum Denoising Framework for Multimodal Aspect-based Sentiment Analysis

Oct 23, 2023
Fei Zhao, Chunhui Li, Zhen Wu, Yawen Ouyang, Jianbing Zhang, Xinyu Dai

Figure 1 for M2DF: Multi-grained Multi-curriculum Denoising Framework for Multimodal Aspect-based Sentiment Analysis
Figure 2 for M2DF: Multi-grained Multi-curriculum Denoising Framework for Multimodal Aspect-based Sentiment Analysis
Figure 3 for M2DF: Multi-grained Multi-curriculum Denoising Framework for Multimodal Aspect-based Sentiment Analysis
Figure 4 for M2DF: Multi-grained Multi-curriculum Denoising Framework for Multimodal Aspect-based Sentiment Analysis

Multimodal Aspect-based Sentiment Analysis (MABSA) is a fine-grained Sentiment Analysis task, which has attracted growing research interests recently. Existing work mainly utilizes image information to improve the performance of MABSA task. However, most of the studies overestimate the importance of images since there are many noise images unrelated to the text in the dataset, which will have a negative impact on model learning. Although some work attempts to filter low-quality noise images by setting thresholds, relying on thresholds will inevitably filter out a lot of useful image information. Therefore, in this work, we focus on whether the negative impact of noisy images can be reduced without modifying the data. To achieve this goal, we borrow the idea of Curriculum Learning and propose a Multi-grained Multi-curriculum Denoising Framework (M2DF), which can achieve denoising by adjusting the order of training data. Extensive experimental results show that our framework consistently outperforms state-of-the-art work on three sub-tasks of MABSA.

* Accepted by EMNLP 2023 
Viaarxiv icon

DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking

Oct 09, 2023
Shangyu Xing, Fei Zhao, Zhen Wu, Chunhui Li, Jianbing Zhang, Xinyu Dai

Figure 1 for DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking
Figure 2 for DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking
Figure 3 for DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking
Figure 4 for DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking

Multimodal Entity Linking (MEL) is a task that aims to link ambiguous mentions within multimodal contexts to referential entities in a multimodal knowledge base. Recent methods for MEL adopt a common framework: they first interact and fuse the text and image to obtain representations of the mention and entity respectively, and then compute the similarity between them to predict the correct entity. However, these methods still suffer from two limitations: first, as they fuse the features of text and image before matching, they cannot fully exploit the fine-grained alignment relations between the mention and entity. Second, their alignment is static, leading to low performance when dealing with complex and diverse data. To address these issues, we propose a novel framework called Dynamic Relation Interactive Network (DRIN) for MEL tasks. DRIN explicitly models four different types of alignment between a mention and entity and builds a dynamic Graph Convolutional Network (GCN) to dynamically select the corresponding alignment relations for different input samples. Experiments on two datasets show that DRIN outperforms state-of-the-art methods by a large margin, demonstrating the effectiveness of our approach.

* Accepted by ACM MM 2023 
Viaarxiv icon

Towards Better Chain-of-Thought Prompting Strategies: A Survey

Oct 08, 2023
Zihan Yu, Liang He, Zhen Wu, Xinyu Dai, Jiajun Chen

Figure 1 for Towards Better Chain-of-Thought Prompting Strategies: A Survey
Figure 2 for Towards Better Chain-of-Thought Prompting Strategies: A Survey
Figure 3 for Towards Better Chain-of-Thought Prompting Strategies: A Survey

Chain-of-Thought (CoT), a step-wise and coherent reasoning chain, shows its impressive strength when used as a prompting strategy for large language models (LLM). Recent years, the prominent effect of CoT prompting has attracted emerging research. However, there still lacks of a systematic summary about key factors of CoT prompting and comprehensive guide for prompts utilizing. For a deeper understanding about CoT prompting, we survey on a wide range of current research, presenting a systematic and comprehensive analysis on several factors that may influence the effect of CoT prompting, and introduce how to better apply it in different applications under these discussions. We further analyze the challenges and propose some future directions about CoT prompting. This survey could provide an overall reference on related research.

Viaarxiv icon

Dynamic Demonstrations Controller for In-Context Learning

Sep 30, 2023
Fei Zhao, Taotian Pang, Zhen Wu, Zheng Ma, Shujian Huang, Xinyu Dai

In-Context Learning (ICL) is a new paradigm for natural language processing (NLP), where a large language model (LLM) observes a small number of demonstrations and a test instance as its input, and directly makes predictions without updating model parameters. Previous studies have revealed that ICL is sensitive to the selection and the ordering of demonstrations. However, there are few studies regarding the impact of the demonstration number on the ICL performance within a limited input length of LLM, because it is commonly believed that the number of demonstrations is positively correlated with model performance. In this paper, we found this conclusion does not always hold true. Through pilot experiments, we discover that increasing the number of demonstrations does not necessarily lead to improved performance. Building upon this insight, we propose a Dynamic Demonstrations Controller (D$^2$Controller), which can improve the ICL performance by adjusting the number of demonstrations dynamically. The experimental results show that D$^2$Controller yields a 5.4% relative improvement on eight different sizes of LLMs across ten datasets. Moreover, we also extend our method to previous ICL models and achieve competitive results.

* Under review 
Viaarxiv icon

ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with Unpaired Stylistic Corpora

Aug 02, 2023
Kanzhi Cheng, Zheng Ma, Shi Zong, Jianbing Zhang, Xinyu Dai, Jiajun Chen

Generating visually grounded image captions with specific linguistic styles using unpaired stylistic corpora is a challenging task, especially since we expect stylized captions with a wide variety of stylistic patterns. In this paper, we propose a novel framework to generate Accurate and Diverse Stylized Captions (ADS-Cap). Our ADS-Cap first uses a contrastive learning module to align the image and text features, which unifies paired factual and unpaired stylistic corpora during the training process. A conditional variational auto-encoder is then used to automatically memorize diverse stylistic patterns in latent space and enhance diversity through sampling. We also design a simple but effective recheck module to boost style accuracy by filtering style-specific captions. Experimental results on two widely used stylized image captioning datasets show that regarding consistency with the image, style accuracy and diversity, ADS-Cap achieves outstanding performances compared to various baselines. We finally conduct extensive analyses to understand the effectiveness of our method. Our code is available at https://github.com/njucckevin/ADS-Cap.

* Accepted at Natural Language Processing and Chinese Computing (NLPCC) 2022 
Viaarxiv icon

Measuring Your ASTE Models in The Wild: A Diversified Multi-domain Dataset For Aspect Sentiment Triplet Extraction

May 27, 2023
Ting Xu, Huiyun Yang, Zhen Wu, Jiaze Chen, Fei Zhao, Xinyu Dai

Figure 1 for Measuring Your ASTE Models in The Wild: A Diversified Multi-domain Dataset For Aspect Sentiment Triplet Extraction
Figure 2 for Measuring Your ASTE Models in The Wild: A Diversified Multi-domain Dataset For Aspect Sentiment Triplet Extraction
Figure 3 for Measuring Your ASTE Models in The Wild: A Diversified Multi-domain Dataset For Aspect Sentiment Triplet Extraction
Figure 4 for Measuring Your ASTE Models in The Wild: A Diversified Multi-domain Dataset For Aspect Sentiment Triplet Extraction

Aspect Sentiment Triplet Extraction (ASTE) is widely used in various applications. However, existing ASTE datasets are limited in their ability to represent real-world scenarios, hindering the advancement of research in this area. In this paper, we introduce a new dataset, named DMASTE, which is manually annotated to better fit real-world scenarios by providing more diverse and realistic reviews for the task. The dataset includes various lengths, diverse expressions, more aspect types, and more domains than existing datasets. We conduct extensive experiments on DMASTE in multiple settings to evaluate previous ASTE approaches. Empirical results demonstrate that DMASTE is a more challenging ASTE dataset. Further analyses of in-domain and cross-domain settings provide promising directions for future research. Our code and dataset are available at https://github.com/NJUNLP/DMASTE.

* 15pages, 5 figures, ACL2023 
Viaarxiv icon

An Empirical Study and Improvement for Speech Emotion Recognition

Apr 08, 2023
Zhen Wu, Yizhe Lu, Xinyu Dai

Figure 1 for An Empirical Study and Improvement for Speech Emotion Recognition
Figure 2 for An Empirical Study and Improvement for Speech Emotion Recognition
Figure 3 for An Empirical Study and Improvement for Speech Emotion Recognition
Figure 4 for An Empirical Study and Improvement for Speech Emotion Recognition

Multimodal speech emotion recognition aims to detect speakers' emotions from audio and text. Prior works mainly focus on exploiting advanced networks to model and fuse different modality information to facilitate performance, while neglecting the effect of different fusion strategies on emotion recognition. In this work, we consider a simple yet important problem: how to fuse audio and text modality information is more helpful for this multimodal task. Further, we propose a multimodal emotion recognition model improved by perspective loss. Empirical results show our method obtained new state-of-the-art results on the IEMOCAP dataset. The in-depth analysis explains why the improved model can achieve improvements and outperforms baselines.

* Accepted by ICASSP 2023 
Viaarxiv icon

Differentiable Fuzzy $\mathcal{ALC}$: A Neural-Symbolic Representation Language for Symbol Grounding

Dec 01, 2022
Xuan Wu, Xinhao Zhu, Yizheng Zhao, Xinyu Dai

Figure 1 for Differentiable Fuzzy $\mathcal{ALC}$: A Neural-Symbolic Representation Language for Symbol Grounding
Figure 2 for Differentiable Fuzzy $\mathcal{ALC}$: A Neural-Symbolic Representation Language for Symbol Grounding
Figure 3 for Differentiable Fuzzy $\mathcal{ALC}$: A Neural-Symbolic Representation Language for Symbol Grounding
Figure 4 for Differentiable Fuzzy $\mathcal{ALC}$: A Neural-Symbolic Representation Language for Symbol Grounding

Neural-symbolic computing aims at integrating robust neural learning and sound symbolic reasoning into a single framework, so as to leverage the complementary strengths of both of these, seemingly unrelated (maybe even contradictory) AI paradigms. The central challenge in neural-symbolic computing is to unify the formulation of neural learning and symbolic reasoning into a single framework with common semantics, that is, to seek a joint representation between a neural model and a logical theory that can support the basic grounding learned by the neural model and also stick to the semantics of the logical theory. In this paper, we propose differentiable fuzzy $\mathcal{ALC}$ (DF-$\mathcal{ALC}$) for this role, as a neural-symbolic representation language with the desired semantics. DF-$\mathcal{ALC}$ unifies the description logic $\mathcal{ALC}$ and neural models for symbol grounding; in particular, it infuses an $\mathcal{ALC}$ knowledge base into neural models through differentiable concept and role embeddings. We define a hierarchical loss to the constraint that the grounding learned by neural models must be semantically consistent with $\mathcal{ALC}$ knowledge bases. And we find that capturing the semantics in grounding solely by maximizing satisfiability cannot revise grounding rationally. We further define a rule-based loss for DF adapting to symbol grounding problems. The experiment results show that DF-$\mathcal{ALC}$ with rule-based loss can improve the performance of image object detectors in an unsupervised learning way, even in low-resource situations.

Viaarxiv icon

Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

Oct 18, 2022
Zheng Ma, Shi Zong, Mianzhi Pan, Jianbing Zhang, Shujian Huang, Xinyu Dai, Jiajun Chen

Figure 1 for Probing Cross-modal Semantics Alignment Capability from the Textual Perspective
Figure 2 for Probing Cross-modal Semantics Alignment Capability from the Textual Perspective
Figure 3 for Probing Cross-modal Semantics Alignment Capability from the Textual Perspective
Figure 4 for Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

In recent years, vision and language pre-training (VLP) models have advanced the state-of-the-art results in a variety of cross-modal downstream tasks. Aligning cross-modal semantics is claimed to be one of the essential capabilities of VLP models. However, it still remains unclear about the inner working mechanism of alignment in VLP models. In this paper, we propose a new probing method that is based on image captioning to first empirically study the cross-modal semantics alignment of VLP models. Our probing method is built upon the fact that given an image-caption pair, the VLP models will give a score, indicating how well two modalities are aligned; maximizing such scores will generate sentences that VLP models believe are of good alignment. Analyzing these sentences thus will reveal in what way different modalities are aligned and how well these alignments are in VLP models. We apply our probing method to five popular VLP models, including UNITER, ROSITA, ViLBERT, CLIP, and LXMERT, and provide a comprehensive analysis of the generated captions guided by these models. Our results show that VLP models (1) focus more on just aligning objects with visual words, while neglecting global semantics; (2) prefer fixed sentence patterns, thus ignoring more important textual information including fluency and grammar; and (3) deem the captions with more visual words are better aligned with images. These findings indicate that VLP models still have weaknesses in cross-modal semantics alignment and we hope this work will draw researchers' attention to such problems when designing a new VLP model.

* Findings of EMNLP2022 
Viaarxiv icon